From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 11533 invoked from network); 25 Sep 2002 11:01:24 -0000 Received: from sunsite.dk (130.225.247.90) by ns1.primenet.com.au with SMTP; 25 Sep 2002 11:01:24 -0000 Received: (qmail 16 invoked by alias); 25 Sep 2002 11:01:18 -0000 Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm Precedence: bulk X-No-Archive: yes X-Seq: 17729 Received: (qmail 29998 invoked from network); 25 Sep 2002 11:01:17 -0000 Message-ID: <6134254DE87BD411908B00A0C99B044F042E3E33@mowd019a.mow.siemens.ru> From: Borzenkov Andrey To: "'Zsh hackers list'" Subject: RE: UTF-8 fonts Date: Wed, 25 Sep 2002 15:11:39 +0400 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain Just to make it clear. Is the aim to use UTF-8 internally or to support (arbitrary) multibyte encoding? > See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of > the subject. > > My first thought about using UTF-8 instead of eight bit characters this sounds like you want to convert input to UTF-8 internally? > was > that we would have to replace the current `Meta' system. However, I > don't think we do since the current system will seamlessly translate > from UTF-8 input to UTF-8 output. > > Therefore, all we have to do is modify the shell's internals at the > point where it actually compares characters --- or, more generally, > tries to turn metafied sequences into a single character --- to use the > normal UTF8 rules. There may also be some extra places where counting > the length needs changing. > You also need to modify any place where shell compares or translates (upper <-> lower) characters. This is by definition locale dependent - collating order is different is different languages even when they use the same character set. Which means you can use UTF-8 (or, more generally, any multibyte encoding) only if your current locale supports it. Which in effect means using wc* and mb* function suite anyway. But this also means you cannot assume anything about current character set and cannot assume that it is transparent w.r.t. current string handling in zsh. > Unicode characters are up to 6 bytes, so either with 64-bit integers we > can do a direct comparison some bit arithmetic, or we can just use > strncmp. (I don't fancy relying on internationalisation support for > this this but in principle that's probably the right thing to do.) > Hence I don't see the necessity for actually decoding UTF-8 into Unicode > at any point, just deciding the number of bytes. Not doing this avoids > problems with overlong encodings (ones which illegally represent a > character using too many bytes): an overlong encoding will always > compare differently to the standard encoding. > How do you know your input (and strings you are processing) are UTF-8? Besides, standards do not provide a way to input multibyte character - you can only read wide character. > Probably we need a configuration option to switch this on or off. > Yes, either we rely on standard locale support (and do not care what character set is being used) or we must provide some OOB means to define character set in use. > Zle might be a bit more of a problem. The web page I referred to above > gives the hopeful message that all encoding to/decoding from UTF-8 at > the terminal is handled by the terminal driver. So for zle we have to > worry about things like > - determining whether the terminal is actually in UTF-8 mode, probably > from the locale Impossible. Local names are just arbitrary chosen strings; there is no "character set code" defined in any locale definition, at least on Unix. > - how UTF-8 encoded characters interfere with meta-bindings. May be > good enough simply not to use these, at least while we work out what's > what > - reading multi-byte characters --- timeouts and the like use standard OS interfaces to read wide characters. > - getting the right length for displaying, deleting, copying > etc. multi-byte characters. Apart from counting continutation > bytes, we may be stuck with using wcwidth for display. This is a pain > because it involves explicity wchar_t's, and I have no experience at > all with these (except that they mess up compilation of otherwise > trivial > string-handling functions). > - all the stuff I've forgotten. > > Any comments? > -andrey