PATCH: multibyte FAQ

zsh-workers
 help / color / mirror / code / Atom feed

* PATCH: multibyte FAQ
@ 2005-12-14 18:31 Peter Stephenson
  2005-12-14 18:41 ` Peter Stephenson
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Peter Stephenson @ 2005-12-14 18:31 UTC (permalink / raw)
  To: Zsh hackers list

This adds notes on multibyte input to the FAQ.

Your attention is also drawn to the list of systems where multibyte mode
works in INSTALL.  This is all the information I currently have, and
much of it is actually guesswork.

Index: INSTALL
===================================================================
RCS file: /cvsroot/zsh/zsh/INSTALL,v
retrieving revision 1.21
diff -u -r1.21 INSTALL
--- INSTALL	24 Nov 2005 11:46:47 -0000	1.21
+++ INSTALL	14 Dec 2005 18:24:42 -0000
@@ -272,7 +272,16 @@
 --disable-multibyte.  Reports of systems where multibyte support was not
 enabled by default but --enable-multibyte resulted in a usable shell would
 be appreciated.  The developers are not aware of any need to use
---disable-multibyte and this should be reported as a bug.
+--disable-multibyte and this should be reported as a bug.  Currently
+multibyte mode is believed to work automatically on:
+
+  - All(?) current GNU/Linux distributions
+  - All(?) current BSD variants
+  - OS X 10.4.3
+
+and to work when configured with --enable-multibyte on:
+
+  - Solaris 8 and later
 
 The main shell is not yet aware of multibyte characters, so for example the
 length of a scalar parameter will return the number of bytes, not
@@ -281,6 +290,8 @@
 work correctly with characters in multibyte character sets beyond the ASCII
 subset.
 
+See chapter 5 in the FAQ for some notes on multibyte input.
+
 Memory Routines
 ---------------
 
Index: Etc/FAQ.yo
===================================================================
RCS file: /cvsroot/zsh/zsh/Etc/FAQ.yo,v
retrieving revision 1.28
diff -u -r1.28 FAQ.yo
--- Etc/FAQ.yo	6 Dec 2005 10:50:37 -0000	1.28
+++ Etc/FAQ.yo	14 Dec 2005 18:24:48 -0000
@@ -43,11 +43,11 @@
 whenman(report(ARG1)(ARG2)(ARG3))\
 whenms(report(ARG1)(ARG2)(ARG3))\
 whensgml(report(ARG1)(ARG2)(ARG3)))
-myreport(Z-Shell Frequently-Asked Questions)(Peter Stephenson)(2005/07/18)
+myreport(Z-Shell Frequently-Asked Questions)(Peter Stephenson)(2005/12/14)
 COMMENT(-- the following are for Usenet and must appear first)\
 description(\
 mydit(Archive-Name:) unix-faq/shell/zsh
-mydit(Last-Modified:) 2005/07/18
+mydit(Last-Modified:) 2005/12/14
 mydit(Submitted-By:) email(pws@pwstephenson.fsnet.co.uk (Peter Stephenson))
 mydit(Posting-Frequency:) Monthly
 mydit(Copyright:) (C) P.W. Stephenson, 1995--2005 (see end of document)
@@ -126,11 +126,18 @@
 4.5. How do I get started with programmable completion?
 4.6. Suppose I want to complete all files during a special completion?
 
-Chapter 5:  The future of zsh
-5.1. What bugs are currently known and unfixed? (Plus recent important changes)
-5.2. Where do I report bugs, get more info / who's working on zsh?
-5.3. What's on the wish-list?
-5.4. Did zsh have problems in the year 2000?
+Chapter 5:  Multibyte input
+
+5.1. What is multibyte input?
+5.2. How does zsh handle multibyte input?
+5.3. How do I ensure multibyte input works on my system?
+5.4. How can I input characters that aren't on my keyboard?
+
+Chapter 6:  The future of zsh
+6.1. What bugs are currently known and unfixed? (Plus recent important changes)
+6.2. Where do I report bugs, get more info / who's working on zsh?
+6.3. What's on the wish-list?
+6.4. Did zsh have problems in the year 2000?
 
 Acknowledgments
 
@@ -1945,6 +1952,175 @@
   such as expansion or approximate completion.
 
 
+chapter(Multibyte input)
+
+sect(What is multibyte input?)
+
+  For a long time computers had a simple idea of a character: each octet
+  (8-bit byte) of text contained one character.  This meant an application
+  could only use 256 characters at once.  The first 128 characters (0 to
+  127) on Unix and similar systems usually corresponded to the ASCII
+  character set, as they still do.  So all other possibilities had to be
+  crammed into the remaining 128.  This was done by picking the appropriate
+  character set for the use you were making.  For example, ISO 8859
+  specified a set of extensions to ASCII for various alphabets.
+
+  This was fine for simple extensions and certain short enough relatives of
+  the Latin alphabet (with no more than a few dozen alphabetic characters),
+  but useless for complex alphabets.  Also, having a different character
+  set for each language is inconvenient: you have to start a new terminal
+  to run the shell with each character set.  So the character set had to be
+  extended.  To cut a long story short, the world has mostly standardised
+  on a character set called Unicode, related to the international standard
+  ISO 10646.  The intention is that this will contain every single
+  character used in all the languages of the world.
+
+  This has far too many characters to fit into a single octet.  What's
+  more, UNIX utilities such as zsh are so used to dealing with ASCII that
+  removing it would cause no end of trouble.  So what happens is this: the
+  128 ASCII characters are kept exactly the same (and they're the same as
+  the first 128 characters of Unicode), but the remaining 128 characters
+  are used to build up any other Unicode character by combining multiple
+  octets together.  The shell doesn't need to interpret these directly; it
+  just needs to ask the system library how many octets form the next
+  character, and if there's a valid character there at all.  (It can also
+  ask the system what width the character takes up on the screen, so that
+  characters no longer need to be exacxtly one position wide.)
+
+  The way this is done is called UTF-8.  Multibyte encodings of other
+  character sets exist (you might encounter them for Asian character sets);
+  zsh will be able to use any such encoding as long as it contains ASCII as
+  a single-octet subset and the system can provide information about other
+  characters.  However, in the case of Unicode, UTF-8 is the only one you
+  are likely to enounter.
+
+  (In case you're confused: Unicode is the characters set, while UTF-8 is
+  an encoding of it.  You might hear about other encodings, such as UCS-2
+  and UCS-4 which are basically the character's index in the character set
+  as a two-octet or four-octet integer.  You might see files encoded this
+  way, for example on Windows, but the shell can't deal directly with text
+  in those formats.)
+
+
+sect(How does zsh handle multibyte input?)
+
+  Until version 4.3, zsh didn't handle multibyte input properly at all.
+  Each octet in a multibyte character would look to the shell like a
+  separate character.  If your terminal handled the character set,
+  characters might appear correct on screen, but trying to edit them would
+  cause all sorts of odd effects.  (It was possible to edit in zsh using
+  single-byte extensions of ASCII such as the ISO 8859 family, however.)
+
+  From version 4.3, multibyte input is handled in the line editor if zsh
+  has been compiled with the appropriate definitions.  This will happen
+  automatically if the compiler defines __STDC_ISO_10646__, which is true
+  for many recent GNU-based systems.  On other systems you must configure
+  zsh with the argument --enable-multibyte to configure.  (The reason for
+  this is that the presence of __STDC_ISO_10646__ ensures all the required
+  library support is present, short-circuiting a large number of
+  configuration tests.)  Explicit use of --enable-multibyte should work on
+  many other recent UNIX systems; if it works on yours, and that's not
+  mentioned in the shell documentation, please report this to
+  zsh-workers@sunsite.dk, and if it doesn't but you can work out why not
+  we'd also be interested in hearing.
+
+  You can test if multibyte handling is compiled into your version of the
+  shell by running:
+  verb(
+    (bindkey -m)
+  )
+  which should output a warning:
+  verb(
+    bindkey: warning: `bindkey -m' disables multibyte support
+  )
+  If it doesn't, you don't have multibyte support in your shell.  The
+  parentheses are there to run the command in a subshell, which protects
+  your interactive shell from the effects being warned about.
+
+  Multibyte strings are not yet handled anywhere else in the shell.  This
+  means, for example, patterns treat multibyte characters as a set of single
+  octets and the ${#var} syntax counts octets, not characters.  There will
+  probably be new syntax to ensure that zsh can work both in its traditional
+  way as well as when interpreting multibyte characters.
+
+
+sect(How do I ensure multibyte input works on my system?)
+
+  Once you have a version of zsh with multibyte support, you need to
+  ensure the envivronment is correct.  We'll assume you're using UTF-8.
+  Many modern systems may come set up correctly already.  Try one of
+  the editing widgets described in the next section to see.
+
+  There are basically three components.
+
+  itemize(
+   it() The locale.  This describes a whole series of features specific
+      to countries or regions of which the character set is one.  Usually
+      it is controlled by the environment variable tt(LANG) (there are
+      others but this is the one to start with).  You need to find a
+      locale whose name contains mytt(UTF-8).  This will be a variant on
+      your usual locale, which typically indicates the language and
+      country; for example, mine is mytt(en_GB.UTF-8).  Luckily, zsh can
+      complete locale names, so if you have the new completion system
+      loaded you can type mytt(export LANG=) and attempt to complete a
+      suitable locale.  It's the locale that tells the shell to expect the
+      right form of multibyte input.  (However, there's no guarantee that
+      the shell is actually going to get this input: for example, if you
+      edit file names that have been created using a different character
+      set it won't work properly.)
+   it() The terminal emulator.  Those that are supplied with a recent
+      desktop environment, such as gnome-terminal, are likely to have
+      extensive support for localization and may work correctly as soon
+      as they know the locale.
+   it() The font.  If you selected this from a menu in your terminal
+      emulator, there's a good chance it already selected the right
+      character set to go with it.  If you hand-picked an old fashioned
+      X font with a lot of dashes, you need to make sure it ends with
+      the right character encoding, mytt(iso-10646-1) (and not, for
+      example, mytt(iso-8859-1)).  Not all characters will be available
+      in any font, and some fonts may have a more restricted range of
+      Unicode characters than others.
+  )
+
+
+sect(How can I input characters that aren't on my keyboard?)
+
+  Two functions are provided with zsh that help you input characters.
+  As with all editing widgets implemented by functions, you need to
+  mark the function for autoload, create the widget, and, if you are
+  going to use it frequently, bind it to a key sequence.  The
+  following binds tt(insert-composed-char) to F5 on my keyboard:
+  verb(
+    autoload -Uz insert-composed-char
+    zle -N insert-composed-char
+    bindkey '\e[15~' insert-composed-char
+  )
+
+  The two widgets are described in the tt(zshcontrib(1)) manual
+  page, but here is a brief summary:
+
+  tt(insert-composed-char) is followed by two characters that
+  are a mnemonic for a multibyte character.  For example mytt(a:)
+  is a with an umlaut; mytt(cH) is the symbol for hearts on a playing
+  card.  Various accented characters, European and related alphabets,
+  and punctuation and mathematical symbols are available.  The
+  mnemonics are mostly those given by RFC 1345, see
+  url(http://www.faqs.org/rfcs/rfc1345.html)\
+(http://www.faqs.org/rfcs/rfc1345.html).
+
+  tt(insert-unicode-char) is used to input a Unicode character by
+  its hexadecimal number.  This is the number given in the Unicode
+  character charts, see for example \
+url(http://www.unicode.org/charts/)(http://www.unicode.org/charts/).
+  You need to execute the function, then type the hexadecimal number
+  (you can omit any leading zeroes), then execute the function again.
+
+  Both functions can be used without multibyte mode, provided the locale is
+  correct and the character selected exists in the current character set;
+  however, using UTF-8 massively extends the number of valid characters
+  that can be produced.
+
+
 chapter(The future of zsh)
 
 sect(What bugs are currently known and unfixed? (Plus recent \

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


This message has been scanned for viruses by BlackSpider MailControl - www.blackspider.com


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH: multibyte FAQ
  2005-12-14 18:31 PATCH: multibyte FAQ Peter Stephenson
@ 2005-12-14 18:41 ` Peter Stephenson
  2005-12-15 14:42   ` Peter Stephenson
  2005-12-14 19:25 ` [22076] " Danek Duvall
  2005-12-18 14:14 ` PATCH: multibyte FAQ (MacOS X) Jun T.
  2 siblings, 1 reply; 14+ messages in thread
From: Peter Stephenson @ 2005-12-14 18:41 UTC (permalink / raw)
  To: zsh-workers

Peter Stephenson <pws@csr.com> wrote:
> +and to work when configured with --enable-multibyte on:
> +
> +  - Solaris 8 and later

Hmm... not convinced any more.  It works when the character set is actually
ISO 8859-1, but it seems less keen on en_US.UTF-8, which is the only
English language UTF-8 locale we have installed.  I don't think we
currently have Solaris 9 anywhere.

> +      character set to go with it.  If you hand-picked an old fashioned
> +      X font with a lot of dashes, you need to make sure it ends with
> +      the right character encoding, mytt(iso-10646-1) (and not, for
> +      example, mytt(iso-8859-1)).  Not all characters will be available

Those first hyphens shouldn't be there.


-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


This message has been scanned for viruses by BlackSpider MailControl - www.blackspider.com


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [22076] PATCH: multibyte FAQ
  2005-12-14 18:31 PATCH: multibyte FAQ Peter Stephenson
  2005-12-14 18:41 ` Peter Stephenson
@ 2005-12-14 19:25 ` Danek Duvall
  2005-12-14 21:09   ` Peter Stephenson
  2005-12-18 14:14 ` PATCH: multibyte FAQ (MacOS X) Jun T.
  2 siblings, 1 reply; 14+ messages in thread
From: Danek Duvall @ 2005-12-14 19:25 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

On Wed, Dec 14, 2005 at 06:31:03PM +0000, Peter Stephenson wrote:

> +    bindkey: warning: `bindkey -m' disables multibyte support

Noting that I haven't actually tried 4.3 yet, what happens to people who
use bindkey -m to get all the M- bindings?  I use them all the time, and
can't imagine not being able to use them.  Do you have to find new bindings
for those widgets, or does M-whatever still work, but there's a new way of
binding those keystrokes?

Danek

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [22076] PATCH: multibyte FAQ
  2005-12-14 19:25 ` [22076] " Danek Duvall
@ 2005-12-14 21:09   ` Peter Stephenson
  2005-12-16  9:39     ` Danek Duvall
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Stephenson @ 2005-12-14 21:09 UTC (permalink / raw)
  To: Zsh hackers list

Danek Duvall wrote:
> On Wed, Dec 14, 2005 at 06:31:03PM +0000, Peter Stephenson wrote:
> 
> > +    bindkey: warning: `bindkey -m' disables multibyte support
> 
> Noting that I haven't actually tried 4.3 yet, what happens to people who
> use bindkey -m to get all the M- bindings?  I use them all the time, and
> can't imagine not being able to use them.  Do you have to find new bindings
> for those widgets, or does M-whatever still work, but there's a new way of
> binding those keystrokes?

They still work.  We've managed to retain more or less complete
compatibility with the existing system by the following trick: eight bit
characters are bound to self-insert by default, but the code for
self-insert checks to see if there the input character is incomplete.
If it is, it waits for the rest (up to a delay of $KEYTIMEOUT hundredths
of a second).  If you rebind the character it won't act as self-insert,
but it will act exactly the way it always did.

The disadvantage is that you can't input multibyte characters via the
keyboard if they begin with the same meta sequence unless you bind the
more specific sequence starting with that byte (then the simple meta
binding works, but only after a delay of $KEYTIMEOUT).  However, if you
don't have those keys on your keyboard you probably don't care.  The
supplied widgets insert-composed-char and insert-unicode-char only read
ASCII characters, so aren't affected.

You can run
  bindkey -m 2>/dev/null
to silence the warning if you're happy about not having multibyte input.

-- 
Peter Stephenson <p.w.stephenson@ntlworld.com>
Web page still at http://www.pwstephenson.fsnet.co.uk/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH: multibyte FAQ
  2005-12-14 18:41 ` Peter Stephenson
@ 2005-12-15 14:42   ` Peter Stephenson
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Stephenson @ 2005-12-15 14:42 UTC (permalink / raw)
  To: zsh-workers

Peter Stephenson <pws@csr.com> wrote:
> Peter Stephenson <pws@csr.com> wrote:
> > +and to work when configured with --enable-multibyte on:
> > +
> > +  - Solaris 8 and later
> 
> Hmm... not convinced any more.

This is still the case, but I've got slightly more idea about why behaviour
was poor with --multibyte-enable without assuming wchar_t was UCS-4.
We don't include langinfo.h if MULTIBYTE_SUPPORT is defined, which means
CODESET isn't defined, which means we don't use iconv.  (We had a report
about this a while back, I think from Zvi.)

It doesn't seem worth trying too hard to work out if we don't need
langinfo.h.  I'll apply the following patch and following Oliver's
suggestion back off the other one.

I'm still not getting the resulting multibyte strings handled properly in
Solaris.  I don't know why not or whether they should.  I had a vague
feeling this was basically working...

Index: Src/system.h
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/system.h,v
retrieving revision 1.35
diff -u -r1.35 system.h
--- Src/system.h	28 Oct 2005 17:34:33 -0000	1.35
+++ Src/system.h	15 Dec 2005 14:29:43 -0000
@@ -703,11 +703,10 @@
  */
 # include <wchar.h>
 # include <wctype.h>
-#else
-# ifdef HAVE_LANGINFO_H
-#   include <langinfo.h>
-#   ifdef HAVE_ICONV
-#     include <iconv.h>
-#   endif
-# endif
+#endif
+#ifdef HAVE_LANGINFO_H
+#  include <langinfo.h>
+#  ifdef HAVE_ICONV
+#    include <iconv.h>
+#  endif
 #endif

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

This message has been scanned for viruses by BlackSpider MailControl - www.blackspider.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH: multibyte FAQ
  2005-12-14 21:09   ` Peter Stephenson
@ 2005-12-16  9:39     ` Danek Duvall
  2005-12-16 17:13       ` Bart Schaefer
  0 siblings, 1 reply; 14+ messages in thread
From: Danek Duvall @ 2005-12-16  9:39 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

On Wed, Dec 14, 2005 at 09:09:28PM +0000, Peter Stephenson wrote:

> The disadvantage is that you can't input multibyte characters via the
> keyboard if they begin with the same meta sequence unless you bind the
> more specific sequence starting with that byte

Is there a simple way to find out which meta sequence block which multibyte
characters for people who aren't too familiar with UTF-8 encoding?

It might also be worth mentioning that, depending on your terminal
emulator, you might be able to sidestep the problem entirely.  In xterm at
least, you can control whether the meta key sets the eighth bit or sends an
escape character, depending on the values of the eightBitInput and
metaSendsEscape resources.  Other terminals may have similar setups.  There
may still be a certain set of people who won't get exactly what they want,
but it's probably pretty small.

Anyway, thanks for the explanation!

Danek

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH: multibyte FAQ
  2005-12-16  9:39     ` Danek Duvall
@ 2005-12-16 17:13       ` Bart Schaefer
  2005-12-18 19:38         ` Danek Duvall
  0 siblings, 1 reply; 14+ messages in thread
From: Bart Schaefer @ 2005-12-16 17:13 UTC (permalink / raw)
  To: Zsh hackers list

On Dec 16,  1:39am, Danek Duvall wrote:
}
} > The disadvantage is that you can't input multibyte characters via
} > the keyboard if they begin with the same meta sequence unless you
} > bind the more specific sequence starting with that byte
} 
} Is there a simple way to find out which meta sequence block which
} multibyte characters for people who aren't too familiar with UTF-8
} encoding?

No, not really.  The set of multibyte characters depends on the current
locale setting, and there are potentially thousands of them, most of
which will be broken by "bindkey -m".

} It might also be worth mentioning that, depending on your terminal
} emulator, you might be able to sidestep the problem entirely.

It's not *exactly* sidestepping the problem, because you still have to
avoid running "bindkey -m".  It may be an alternative to "bindkey -m".

} In xterm at least, you can control whether the meta key sets the
} eighth bit or sends an escape character depending on the values of the
} eightBitInput and metaSendsEscape resources.

That doesn't have the desired effect for me, because eightBitInput has
to be false, which AFAICT means you can't send multibyte characters
either.  Am I doing something wrong, or is my xterm version too old?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH: multibyte FAQ (MacOS X)
  2005-12-14 18:31 PATCH: multibyte FAQ Peter Stephenson
  2005-12-14 18:41 ` Peter Stephenson
  2005-12-14 19:25 ` [22076] " Danek Duvall
@ 2005-12-18 14:14 ` Jun T.
  2005-12-18 15:26   ` Andrey Borzenkov
  2005-12-18 19:41   ` Peter Stephenson
  2 siblings, 2 replies; 14+ messages in thread
From: Jun T. @ 2005-12-18 14:14 UTC (permalink / raw)
  To: zsh-workers

I'm just a user (not a worker) of zsh but want to comment on the mulibyte support on MacOS X.

At 18:31 +0000 05.12.14, Peter Stephenson wrote:
>+multibyte mode is believed to work automatically on:
>+
>+  - All(?) current GNU/Linux distributions
>+  - All(?) current BSD variants
>+  - OS X 10.4.3

In MacOS X, configure does not enable multibyte by default, because __STDC_ISO_10646__ is not defined. So "./configure; make" gives a zsh without multibyte support, although it supports "\u....".

If configured with --enable-mulibyte, then I get a zsh which supports multibyte relatively well, but at least two problems remain;

(1) "\u...." (and so insert-composed-char and insert-unicode-char) doesn't work. This can be fixed by a patch proposed in 22085.

(2) In MacOS X, the standard file system is HFS+ which stores filenames in Unicode but in fully decomposed form. This means the filename returned by the completion is also in decomposed form; the filename is displayed correctly, but it can't be edited correctly. And if I go up/down in the history stack the prompt is destroyed.

I can test only on 10.4.3 but guess the situation is the same in 10.3.x/10.4.x.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH: multibyte FAQ (MacOS X)
  2005-12-18 14:14 ` PATCH: multibyte FAQ (MacOS X) Jun T.
@ 2005-12-18 15:26   ` Andrey Borzenkov
  2005-12-19  9:08     ` Jun T.
  2005-12-18 19:41   ` Peter Stephenson
  1 sibling, 1 reply; 14+ messages in thread
From: Andrey Borzenkov @ 2005-12-18 15:26 UTC (permalink / raw)
  To: zsh-workers; +Cc: Jun T.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sunday 18 December 2005 17:14, Jun T. wrote:
> In MacOS X, configure does not enable multibyte by default, because
> __STDC_ISO_10646__ is not defined. So "./configure; make" gives a zsh
> without multibyte support, although it supports "\u....".
[...]
> (2) In MacOS X, the standard file system is HFS+ which stores filenames in
> Unicode but in fully decomposed form. This means the filename returned by
> the completion is also in decomposed form; the filename is displayed
> correctly, but it can't be edited correctly. And if I go up/down in the
> history stack the prompt is destroyed.
>

I wonder, when decomposition happens? I.e. consider a program that takes file 
name as parameter and compares it with result of readdir? Or even worse, it 
reads it from stdin or file?

To be sure - do you mean that e.g. accented characters are internally kept as 
two characters? Does it agree with <http://www.unicode.org/reports/tr15/>?

- -andrey
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFDpX+dR6LMutpd94wRAsXsAKDHBAOweRBjjSKqKRhfTyEKM6sa6QCfdN8v
xiqrixpfro2fHjHwSMAZo5g=
=igh5
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH: multibyte FAQ
  2005-12-16 17:13       ` Bart Schaefer
@ 2005-12-18 19:38         ` Danek Duvall
  2005-12-18 21:10           ` Bart Schaefer
  0 siblings, 1 reply; 14+ messages in thread
From: Danek Duvall @ 2005-12-18 19:38 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

On Fri, Dec 16, 2005 at 05:13:24PM +0000, Bart Schaefer wrote:

> No, not really.  The set of multibyte characters depends on the current
> locale setting, and there are potentially thousands of them, most of
> which will be broken by "bindkey -m".

Oh, of course.  I was thinking only of UTF-8 again.

> } In xterm at least, you can control whether the meta key sets the
> } eighth bit or sends an escape character depending on the values of the
> } eightBitInput and metaSendsEscape resources.
> 
> That doesn't have the desired effect for me, because eightBitInput has
> to be false, which AFAICT means you can't send multibyte characters
> either.  Am I doing something wrong, or is my xterm version too old?

Not sure, as I haven't actually tried it, and I don't have any non-ascii
keys on my keyboard to try it anyway, but from my reading of the xterm man
page (version 205), it might be the case that it can distinguish between
eight-bit characters and key combinations M-<x> (the description in
metaSendsEscape isn't clear, IMHO), and so setting both eightBitInput and
metaSendsEscape to true would allow eight-bit characters to be sent cleanly
to the terminal, and when you type M-<x>, xterm sends ESC-<x>.

You probably also need to set "utf8: true" or "locale: true" or "locale:
utf8" for multibyte character sets to work properly.

I'm sure Thomas could set the record straight here.

Danek

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH: multibyte FAQ (MacOS X)
  2005-12-18 14:14 ` PATCH: multibyte FAQ (MacOS X) Jun T.
  2005-12-18 15:26   ` Andrey Borzenkov
@ 2005-12-18 19:41   ` Peter Stephenson
  2005-12-21 16:15     ` Peter Stephenson
  1 sibling, 1 reply; 14+ messages in thread
From: Peter Stephenson @ 2005-12-18 19:41 UTC (permalink / raw)
  To: zsh-workers

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 4269 bytes --]

"Jun T." wrote:
> In MacOS X, configure does not enable multibyte by default, because __STDC_IS
> O_10646__ is not defined. So "./configure; make" gives a zsh without multibyt
> e support, although it supports "\u....".

Thanks, this is useful.

> (1) "\u...." (and so insert-composed-char and insert-unicode-char) doesn't wo
> rk. This can be fixed by a patch proposed in 22085.

That's now committed.

> (2) In MacOS X, the standard file system is HFS+ which stores filenames in Un
> icode but in fully decomposed form. This means the filename returned by the c
> ompletion is also in decomposed form; the filename is displayed correctly, bu
> t it can't be edited correctly. And if I go up/down in the history stack the 
> prompt is destroyed.

I don't know anything about this, so if anyone else is able to do
something about it, splendid.

With some suggestions about xterm from Phil Pennock together with
looking through the manual, and an attempt on the Sourceforge compile
farm, I now have the following.  FreeBSD was offline; NetBSD worked with
--enable-multibyte; OpenBSD doesn't have wcswidth() so doesn't compile
with --enable-multibyte, though presumably a few configure tests ought to
help matters.

I've still had no luck with Solaris, even Solaris 9.  It didn't help
that the only UTF-8 locale around was ru_RU.UTF-8, but that isn't the
basic problem and I don't know what is; it seems that even obvious
multibyte strings like accented Latin characters aren't being
recognised, even though all the functions are present, the terminal
works fine with other multibyte systems, and LANG is set correctly.

Index: INSTALL
===================================================================
RCS file: /cvsroot/zsh/zsh/INSTALL,v
retrieving revision 1.22
diff -u -r1.22 INSTALL
--- INSTALL	15 Dec 2005 10:38:55 -0000	1.22
+++ INSTALL	18 Dec 2005 19:32:01 -0000
@@ -276,12 +276,14 @@
 multibyte mode is believed to work automatically on:
 
   - All(?) current GNU/Linux distributions
-  - All(?) current BSD variants
-  - OS X 10.4.3
 
 and to work when configured with --enable-multibyte on:
 
-  - Solaris 8 and later
+  - OS X 10.4.3 (problems have been reported with multibyte characters
+    in HFS file names)
+  - NetBSD 2.0.2
+
+Any help with Solaris 8 or 9 would be appreciated.
 
 The main shell is not yet aware of multibyte characters, so for example the
 length of a scalar parameter will return the number of bytes, not
Index: Etc/FAQ.yo
===================================================================
RCS file: /cvsroot/zsh/zsh/Etc/FAQ.yo,v
retrieving revision 1.29
diff -u -r1.29 FAQ.yo
--- Etc/FAQ.yo	15 Dec 2005 10:38:56 -0000	1.29
+++ Etc/FAQ.yo	18 Dec 2005 19:32:05 -0000
@@ -2071,7 +2071,25 @@
    it() The terminal emulator.  Those that are supplied with a recent
       desktop environment, such as gnome-terminal, are likely to have
       extensive support for localization and may work correctly as soon
-      as they know the locale.
+      as they know the locale.  You can enable UTF-8 support for
+      tt(xterm) in its application defaults file.  The following are
+      the relevant resources; you donʼt actually need all of them, as
+      described below.  If you use a mytt(~/.Xdefaults) or
+      mytt(~/.Xresources) file for setting resources, prefix all the lines
+      with mytt(xterm):
+      verb(
+        *wideChars: true
+        *locale: true
+        *utf8: 1
+        *vt100Graphics: true
+      )
+      This turns on support for wide characters (this is enabled by the
+      tt(utf8) resource, too); enables conversions to UTF-8 from other
+      locales (this is the key resource and actually overrides
+      mytt(utf8)); turns on UTF-8 mode (this resource is mostly used to
+      force use of UTF-8 characters if your locale system isnʼt up to it);
+      and allows certain graphic characters to work even with UTF-8
+      enabled.  (Thanks to Phil Pennock for suggestions.)
    it() The font.  If you selected this from a menu in your terminal
       emulator, there's a good chance it already selected the right
       character set to go with it.  If you hand-picked an old fashioned

-- 
Peter Stephenson <p.w.stephenson@ntlworld.com>
Web page still at http://www.pwstephenson.fsnet.co.uk/


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH: multibyte FAQ
  2005-12-18 19:38         ` Danek Duvall
@ 2005-12-18 21:10           ` Bart Schaefer
  0 siblings, 0 replies; 14+ messages in thread
From: Bart Schaefer @ 2005-12-18 21:10 UTC (permalink / raw)
  To: Zsh hackers list

On Dec 18, 11:38am, Danek Duvall wrote:
}
} On Fri, Dec 16, 2005 at 05:13:24PM +0000, Bart Schaefer wrote:
} > That doesn't have the desired effect for me, because eightBitInput has
} > to be false, which AFAICT means you can't send multibyte characters
} > either.  Am I doing something wrong, or is my xterm version too old?
} 
} Not sure, as I haven't actually tried it, and I don't have any non-ascii
} keys on my keyboard to try it anyway, but from my reading of the xterm man
} page (version 205), it might be the case that [...] both eightBitInput and
} metaSendsEscape to true would allow eight-bit characters to be sent cleanly
} to the terminal, and when you type M-<x>, xterm sends ESC-<x>.

When I set them both to true, M-<x> sends an 8-bit character.  When I set
eightBitInput to false and metaSendsEscape to true, M-<x> sends ESC-x.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH: multibyte FAQ (MacOS X)
  2005-12-18 15:26   ` Andrey Borzenkov
@ 2005-12-19  9:08     ` Jun T.
  0 siblings, 0 replies; 14+ messages in thread
From: Jun T. @ 2005-12-19  9:08 UTC (permalink / raw)
  To: zsh-workers

(I have subscribed to zsh-workers so no need to reply to me)

At 6:26 PM +0300 05.12.18, Andrey Borzenkov wrote:
>I wonder, when decomposition happens?

It seems that open(2) and opendir(2) accept both precomposed and decomposed form, but they internally convert the filename/dirname into decomposed form. For example

/* a-umlaut: precomposed */
name[0] = 0xc3; name[1] = 0xa4;
name[2] = 0x00;
fd = open(name,O_CREAT,mode);

has the same effect as

/* a + umlaut: decomposed */
name[0] = 0x61;			/* 'a' */
name[1] = 0xcc; name[2] = 0x88;	/* umlaut */
name[3] = 0x00;
fd = open(name,O_CREAT,mode);

and the created file has a filename in decomposed form.
readdir(2) always returns filenames in decomposed form (and UTF-8 encoding).

If a user input a-umlaut from his keyboard, then it is in precomposed form (at least for US and Japanese keyboards). But I think zsh need not to convert it into decomposed form. For example,

zsh% echo hello > Xa-umlaut    (two characters, 'X' and 'a-umlaut')

this works fine even if the 'a-umlaut' is in precomposed form. The created file has filename in decomposed form, and if I use filename completion

zsh% cat X<TAB>

then I get

zsh% cat Xa+umlaut

and the a+umlaut is in decomposed form. The decomposed char is displayed correctly both in Apple's Terminal (which runs in Apple's Aqua window system) and in xterm (with Unicode font, of course), but I can't edit the command line correctly.

You can test decomposed char as follows
(I assume insert-unicode-cahr is bound to ^XU)

zsh% echo ^XU61^XU^XU308^XU
a+umlaut
zsh% (go up in the history stack and try to edit it)

>To be sure - do you mean that e.g. accented characters are internally kept as 
>two characters? Does it agree with <http://www.unicode.org/reports/tr15/>?

How Apple decomposes characters can be found in
http://developer.apple.com/technotes/tn/tn1150table.html

I don't know whether this exactly follows the "Canonical Decomposition (Normalization Form D)" in the Unicode Standard; probably not. 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PATCH: multibyte FAQ (MacOS X)
  2005-12-18 19:41   ` Peter Stephenson
@ 2005-12-21 16:15     ` Peter Stephenson
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Stephenson @ 2005-12-21 16:15 UTC (permalink / raw)
  To: zsh-workers

Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> I've still had no luck with Solaris, even Solaris 9.  It didn't help
> that the only UTF-8 locale around was ru_RU.UTF-8, but that isn't the
> basic problem and I don't know what is; it seems that even obvious
> multibyte strings like accented Latin characters aren't being
> recognised, even though all the functions are present, the terminal
> works fine with other multibyte systems, and LANG is set correctly.

On further investigation it seems that when given the first byte of a
multbyte character, mbrtowc() sometimes returns -1 (error) instead of -2
(incomplete).  Reading another character and then passing both to mbrtowc()
worked.  Sometimes later in the line it works as expected, returning -2 for
an initial byte.  It doesn't seem to be tied to the mbstate_t parameter in
an obvious way, indeed it wasn't obvious that was doing anything at all.

I had a go at rewriting getrestchar() to look for other queued bytes, but
that wasn't good enough and I haven't yet had a chance to look at why.
>From the silence so far it seems like no one has encountered this before.

I wonder if it's some interaction between the library and gcc.

I won't have a chance to get any further before Christmas.  I'll be away
till the 3rd January.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

This message has been scanned for viruses by BlackSpider MailControl - www.blackspider.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2005-12-21 16:20 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-14 18:31 PATCH: multibyte FAQ Peter Stephenson
2005-12-14 18:41 ` Peter Stephenson
2005-12-15 14:42   ` Peter Stephenson
2005-12-14 19:25 ` [22076] " Danek Duvall
2005-12-14 21:09   ` Peter Stephenson
2005-12-16  9:39     ` Danek Duvall
2005-12-16 17:13       ` Bart Schaefer
2005-12-18 19:38         ` Danek Duvall
2005-12-18 21:10           ` Bart Schaefer
2005-12-18 14:14 ` PATCH: multibyte FAQ (MacOS X) Jun T.
2005-12-18 15:26   ` Andrey Borzenkov
2005-12-19  9:08     ` Jun T.
2005-12-18 19:41   ` Peter Stephenson
2005-12-21 16:15     ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).