From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <7359f0490704270741i76150820yb43bdae6603e83bc@mail.gmail.com> Date: Fri, 27 Apr 2007 07:41:10 -0700 From: "Rob Pike" To: "Fans of the OS Plan 9 from Bell Labs" <9fans@cse.psu.edu> Subject: Re: [9fans] speaking of kenc In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <7871fcf50704270721q25223196rea21e7a64ff8ad58@mail.gmail.com> Topicbox-Message-UUID: 5152ca7a-ead2-11e9-9d60-3106f5b1d025 \u doesn't add anything useful to the plan 9 c compiler because of the way its input language is defined. it is useful in c99, because of the way its input language is defined. in plan 9, each escape sequence in a string constant represents a unicode code point. "\x1234" represents a string with a single character with value 0x1234. but in c99, that is an erroneous string because each escape sequence represents a byte. thus to represent a unicode value, one is expected to write out the utf-8 byte sequence. plan 9's "\x1234" becomes, in c99, "\xe1\x88\xb4". the \u escapes were created to give plan 9's functionality without breaking compatibility with the existing implicit meaning of the \x escapes. this subject is quite long and involved - where does utf-8 fit in? how does source encoding interact with internal representation? output encoding? etc. etc. - but the key point about \u is that it makes sense in a utf-8 world with standard c and c++. plan 9's c is very non-standard in this regard. i prefer its design but i don't find \u to be a bad solution. there are a number of related notations in the standards pipeline to deal with some of the other issues, such as forcing utf-8 byte sequences. the notation is going to get pretty ugly. -rob