From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 17302 invoked from network); 25 Sep 2002 17:30:22 -0000 Received: from sunsite.dk (130.225.247.90) by ns1.primenet.com.au with SMTP; 25 Sep 2002 17:30:22 -0000 Received: (qmail 1927 invoked by alias); 25 Sep 2002 17:30:07 -0000 Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm Precedence: bulk X-No-Archive: yes X-Seq: 17732 Received: (qmail 1906 invoked from network); 25 Sep 2002 17:30:06 -0000 X-VirusChecked: Checked cc: zsh-workers@sunsite.dk (Zsh hackers list) In-reply-to: <10303.1032953780@csr.com> From: Oliver Kiddle References: <10303.1032953780@csr.com> To: Peter Stephenson Subject: Re: UTF-8 fonts MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <27735.1032974879.1@logica.com> Date: Wed, 25 Sep 2002 18:29:35 +0100 Sender: kiddleo@logica.com Message-Id: On 25 Sep, Peter Stephenson wrote: > Borzenkov Andrey wrote: > > Just to make it clear. Is the aim to use UTF-8 internally or to support > > (arbitrary) multibyte encoding? > > The first with as much of the second as we can get in without too much So is your aim to use UTF-8 internally in all cases or only when it is the selected character set? I would have thought it would be easier to just use whatever LC_CTYPE (the locale's selected encoding) is internally and use the mb* functions so things work regardless of whether or not LC_CTYPE is a multi-byte character encoding. I don't know much about other multi-byte character encodings that can be used for the input/output locale but I had gathered they at least have the level of compatibility with basic ASCII that allows you to use ASCII characters in string literals. To convert everything to UTF-8 internally, you would have to either use iconv or do messy stuff: the mb* functions deal with whatever LC_CTYPE is and not UTF-8 (unless that's what LC_CTYPE happens to be of course). > We are going to assume that bytes without the top-bit set are ASCII, and > the remainder require mb* handling. Isn't it easier to just do mb* handling on everything and not go around checking the top bit. The mb*() functions should do that sort of stuff for us. mbrtowc() can be used, discarding the returned wchar_t to, for example consume one character of a string. So it worries about whatever the top bit of the bytes are or whatever the underlying multi-byte character encoding requires. > > Impossible. Local names are just arbitrary chosen strings; there is no > > "character set code" defined in any locale definition, at least on Unix. as has been mentioned: nl_langinfo(CODESET) > Read the document at the link I gave which suggests otherwise. However, > I now think we can in any case leave this to the mb* suite to decide. Yes, I think we can. I'm sure you can all use google, but other possibly useful links I had in my bookmarks are these: IBM's patches to various GNU stuff: https://www-124.ibm.com/developer/opensource/linux/patches/i18n/ IBM article that serves as a basic intro: http://www-106.ibm.com/developerworks/library/l-linuni.html howto http://www.tldp.org/HOWTO/Unicode-HOWTO-6.html Oliver This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.