From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 25238 invoked from network); 19 Sep 2002 16:57:37 -0000 Received: from sunsite.dk (130.225.247.90) by ns1.primenet.com.au with SMTP; 19 Sep 2002 16:57:37 -0000 Received: (qmail 5745 invoked by alias); 19 Sep 2002 16:57:27 -0000 Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm Precedence: bulk X-No-Archive: yes X-Seq: 17712 Received: (qmail 5730 invoked from network); 19 Sep 2002 16:57:26 -0000 To: zsh-workers@sunsite.dk (Zsh hackers list) Subject: UTF-8 fonts Date: Thu, 19 Sep 2002 17:56:57 +0100 Message-ID: <14747.1032454617@csr.com> From: Peter Stephenson See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of the subject. My first thought about using UTF-8 instead of eight bit characters was that we would have to replace the current `Meta' system. However, I don't think we do since the current system will seamlessly translate from UTF-8 input to UTF-8 output. Therefore, all we have to do is modify the shell's internals at the point where it actually compares characters --- or, more generally, tries to turn metafied sequences into a single character --- to use the normal UTF8 rules. There may also be some extra places where counting the length needs changing. Unicode characters are up to 6 bytes, so either with 64-bit integers we can do a direct comparison some bit arithmetic, or we can just use strncmp. (I don't fancy relying on internationalisation support for this this but in principle that's probably the right thing to do.) Hence I don't see the necessity for actually decoding UTF-8 into Unicode at any point, just deciding the number of bytes. Not doing this avoids problems with overlong encodings (ones which illegally represent a character using too many bytes): an overlong encoding will always compare differently to the standard encoding. Probably we need a configuration option to switch this on or off. Zle might be a bit more of a problem. The web page I referred to above gives the hopeful message that all encoding to/decoding from UTF-8 at the terminal is handled by the terminal driver. So for zle we have to worry about things like - determining whether the terminal is actually in UTF-8 mode, probably from the locale - how UTF-8 encoded characters interfere with meta-bindings. May be good enough simply not to use these, at least while we work out what's what - reading multi-byte characters --- timeouts and the like - getting the right length for displaying, deleting, copying etc. multi-byte characters. Apart from counting continutation bytes, we may be stuck with using wcwidth for display. This is a pain because it involves explicity wchar_t's, and I have no experience at all with these (except that they mess up compilation of otherwise trivial string-handling functions). - all the stuff I've forgotten. Any comments? -- Peter Stephenson Software Engineer CSR Ltd., Science Park, Milton Road, Cambridge, CB4 0WH, UK Tel: +44 (0)1223 692070 ********************************************************************** The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. **********************************************************************