From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <zsh-workers-return-17732-mason-zsh=primenet.com.au@sunsite.dk>
Received: (qmail 17302 invoked from network); 25 Sep 2002 17:30:22 -0000
Received: from sunsite.dk (130.225.247.90)
  by ns1.primenet.com.au with SMTP; 25 Sep 2002 17:30:22 -0000
Received: (qmail 1927 invoked by alias); 25 Sep 2002 17:30:07 -0000
Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm
Precedence: bulk
X-No-Archive: yes
X-Seq: 17732
Received: (qmail 1906 invoked from network); 25 Sep 2002 17:30:06 -0000
X-VirusChecked: Checked
cc: zsh-workers@sunsite.dk (Zsh hackers list)
In-reply-to: <10303.1032953780@csr.com>
From: Oliver Kiddle <okiddle@yahoo.co.uk>
References: <10303.1032953780@csr.com>
To: Peter Stephenson <pws@csr.com>
Subject: Re: UTF-8 fonts
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-ID: <27735.1032974879.1@logica.com>
Date: Wed, 25 Sep 2002 18:29:35 +0100
Sender: kiddleo@logica.com
Message-Id: <E17uFyh-0007Po-00@bimbo.logica.co.uk>

On 25 Sep, Peter Stephenson wrote:
> Borzenkov Andrey wrote:
> > Just to make it clear. Is the aim to use UTF-8 internally or to support
> > (arbitrary) multibyte encoding?
> 
> The first with as much of the second as we can get in without too much

So is your aim to use UTF-8 internally in all cases or only when it is
the selected character set? I would have thought it would be easier to
just use whatever LC_CTYPE (the locale's selected encoding) is
internally and use the mb* functions so things work regardless of
whether or not LC_CTYPE is a multi-byte character encoding. I don't
know much about other multi-byte character encodings that can be used
for the input/output locale but I had gathered they at least have the
level of compatibility with basic ASCII that allows you to use ASCII
characters in string literals. To convert everything to UTF-8
internally, you would have to either use iconv or do messy stuff: the
mb* functions deal with whatever LC_CTYPE is and not UTF-8 (unless
that's what LC_CTYPE happens to be of course).

> We are going to assume that bytes without the top-bit set are ASCII, and
> the remainder require mb* handling.

Isn't it easier to just do mb* handling on everything and not go around
checking the top bit. The mb*() functions should do that sort of stuff
for us. mbrtowc() can be used, discarding the returned wchar_t to, for
example consume one character of a string. So it worries about whatever
the top bit of the bytes are or whatever the underlying multi-byte
character encoding requires.

> > Impossible. Local names are just arbitrary chosen strings; there is no
> > "character set code" defined in any locale definition, at least on Unix.

as has been mentioned: nl_langinfo(CODESET)

> Read the document at the link I gave which suggests otherwise.  However,
> I now think we can in any case leave this to the mb* suite to decide.

Yes, I think we can.

I'm sure you can all use google, but other possibly useful links I had
in my bookmarks are these:

  IBM's patches to various GNU stuff:
    https://www-124.ibm.com/developer/opensource/linux/patches/i18n/
  IBM article that serves as a basic intro:
    http://www-106.ibm.com/developerworks/library/l-linuni.html
  howto
    http://www.tldp.org/HOWTO/Unicode-HOWTO-6.html

Oliver

This e-mail and any attachment is for authorised use by the intended recipient(s) only.  It may contain proprietary material, confidential information and/or be subject to legal privilege.  It should not be copied, disclosed to, retained or used by, any other party.  If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender.  Thank you.