From: Peter Stephenson <pws@csr.com>
To: zsh-workers@sunsite.dk (Zsh hackers list)
Subject: PATCH: multibyte FAQ
Date: Wed, 14 Dec 2005 18:31:03 +0000 [thread overview]
Message-ID: <200512141831.jBEIV3qQ028002@news01.csr.com> (raw)
This adds notes on multibyte input to the FAQ.
Your attention is also drawn to the list of systems where multibyte mode
works in INSTALL. This is all the information I currently have, and
much of it is actually guesswork.
Index: INSTALL
===================================================================
RCS file: /cvsroot/zsh/zsh/INSTALL,v
retrieving revision 1.21
diff -u -r1.21 INSTALL
--- INSTALL 24 Nov 2005 11:46:47 -0000 1.21
+++ INSTALL 14 Dec 2005 18:24:42 -0000
@@ -272,7 +272,16 @@
--disable-multibyte. Reports of systems where multibyte support was not
enabled by default but --enable-multibyte resulted in a usable shell would
be appreciated. The developers are not aware of any need to use
---disable-multibyte and this should be reported as a bug.
+--disable-multibyte and this should be reported as a bug. Currently
+multibyte mode is believed to work automatically on:
+
+ - All(?) current GNU/Linux distributions
+ - All(?) current BSD variants
+ - OS X 10.4.3
+
+and to work when configured with --enable-multibyte on:
+
+ - Solaris 8 and later
The main shell is not yet aware of multibyte characters, so for example the
length of a scalar parameter will return the number of bytes, not
@@ -281,6 +290,8 @@
work correctly with characters in multibyte character sets beyond the ASCII
subset.
+See chapter 5 in the FAQ for some notes on multibyte input.
+
Memory Routines
---------------
Index: Etc/FAQ.yo
===================================================================
RCS file: /cvsroot/zsh/zsh/Etc/FAQ.yo,v
retrieving revision 1.28
diff -u -r1.28 FAQ.yo
--- Etc/FAQ.yo 6 Dec 2005 10:50:37 -0000 1.28
+++ Etc/FAQ.yo 14 Dec 2005 18:24:48 -0000
@@ -43,11 +43,11 @@
whenman(report(ARG1)(ARG2)(ARG3))\
whenms(report(ARG1)(ARG2)(ARG3))\
whensgml(report(ARG1)(ARG2)(ARG3)))
-myreport(Z-Shell Frequently-Asked Questions)(Peter Stephenson)(2005/07/18)
+myreport(Z-Shell Frequently-Asked Questions)(Peter Stephenson)(2005/12/14)
COMMENT(-- the following are for Usenet and must appear first)\
description(\
mydit(Archive-Name:) unix-faq/shell/zsh
-mydit(Last-Modified:) 2005/07/18
+mydit(Last-Modified:) 2005/12/14
mydit(Submitted-By:) email(pws@pwstephenson.fsnet.co.uk (Peter Stephenson))
mydit(Posting-Frequency:) Monthly
mydit(Copyright:) (C) P.W. Stephenson, 1995--2005 (see end of document)
@@ -126,11 +126,18 @@
4.5. How do I get started with programmable completion?
4.6. Suppose I want to complete all files during a special completion?
-Chapter 5: The future of zsh
-5.1. What bugs are currently known and unfixed? (Plus recent important changes)
-5.2. Where do I report bugs, get more info / who's working on zsh?
-5.3. What's on the wish-list?
-5.4. Did zsh have problems in the year 2000?
+Chapter 5: Multibyte input
+
+5.1. What is multibyte input?
+5.2. How does zsh handle multibyte input?
+5.3. How do I ensure multibyte input works on my system?
+5.4. How can I input characters that aren't on my keyboard?
+
+Chapter 6: The future of zsh
+6.1. What bugs are currently known and unfixed? (Plus recent important changes)
+6.2. Where do I report bugs, get more info / who's working on zsh?
+6.3. What's on the wish-list?
+6.4. Did zsh have problems in the year 2000?
Acknowledgments
@@ -1945,6 +1952,175 @@
such as expansion or approximate completion.
+chapter(Multibyte input)
+
+sect(What is multibyte input?)
+
+ For a long time computers had a simple idea of a character: each octet
+ (8-bit byte) of text contained one character. This meant an application
+ could only use 256 characters at once. The first 128 characters (0 to
+ 127) on Unix and similar systems usually corresponded to the ASCII
+ character set, as they still do. So all other possibilities had to be
+ crammed into the remaining 128. This was done by picking the appropriate
+ character set for the use you were making. For example, ISO 8859
+ specified a set of extensions to ASCII for various alphabets.
+
+ This was fine for simple extensions and certain short enough relatives of
+ the Latin alphabet (with no more than a few dozen alphabetic characters),
+ but useless for complex alphabets. Also, having a different character
+ set for each language is inconvenient: you have to start a new terminal
+ to run the shell with each character set. So the character set had to be
+ extended. To cut a long story short, the world has mostly standardised
+ on a character set called Unicode, related to the international standard
+ ISO 10646. The intention is that this will contain every single
+ character used in all the languages of the world.
+
+ This has far too many characters to fit into a single octet. What's
+ more, UNIX utilities such as zsh are so used to dealing with ASCII that
+ removing it would cause no end of trouble. So what happens is this: the
+ 128 ASCII characters are kept exactly the same (and they're the same as
+ the first 128 characters of Unicode), but the remaining 128 characters
+ are used to build up any other Unicode character by combining multiple
+ octets together. The shell doesn't need to interpret these directly; it
+ just needs to ask the system library how many octets form the next
+ character, and if there's a valid character there at all. (It can also
+ ask the system what width the character takes up on the screen, so that
+ characters no longer need to be exacxtly one position wide.)
+
+ The way this is done is called UTF-8. Multibyte encodings of other
+ character sets exist (you might encounter them for Asian character sets);
+ zsh will be able to use any such encoding as long as it contains ASCII as
+ a single-octet subset and the system can provide information about other
+ characters. However, in the case of Unicode, UTF-8 is the only one you
+ are likely to enounter.
+
+ (In case you're confused: Unicode is the characters set, while UTF-8 is
+ an encoding of it. You might hear about other encodings, such as UCS-2
+ and UCS-4 which are basically the character's index in the character set
+ as a two-octet or four-octet integer. You might see files encoded this
+ way, for example on Windows, but the shell can't deal directly with text
+ in those formats.)
+
+
+sect(How does zsh handle multibyte input?)
+
+ Until version 4.3, zsh didn't handle multibyte input properly at all.
+ Each octet in a multibyte character would look to the shell like a
+ separate character. If your terminal handled the character set,
+ characters might appear correct on screen, but trying to edit them would
+ cause all sorts of odd effects. (It was possible to edit in zsh using
+ single-byte extensions of ASCII such as the ISO 8859 family, however.)
+
+ From version 4.3, multibyte input is handled in the line editor if zsh
+ has been compiled with the appropriate definitions. This will happen
+ automatically if the compiler defines __STDC_ISO_10646__, which is true
+ for many recent GNU-based systems. On other systems you must configure
+ zsh with the argument --enable-multibyte to configure. (The reason for
+ this is that the presence of __STDC_ISO_10646__ ensures all the required
+ library support is present, short-circuiting a large number of
+ configuration tests.) Explicit use of --enable-multibyte should work on
+ many other recent UNIX systems; if it works on yours, and that's not
+ mentioned in the shell documentation, please report this to
+ zsh-workers@sunsite.dk, and if it doesn't but you can work out why not
+ we'd also be interested in hearing.
+
+ You can test if multibyte handling is compiled into your version of the
+ shell by running:
+ verb(
+ (bindkey -m)
+ )
+ which should output a warning:
+ verb(
+ bindkey: warning: `bindkey -m' disables multibyte support
+ )
+ If it doesn't, you don't have multibyte support in your shell. The
+ parentheses are there to run the command in a subshell, which protects
+ your interactive shell from the effects being warned about.
+
+ Multibyte strings are not yet handled anywhere else in the shell. This
+ means, for example, patterns treat multibyte characters as a set of single
+ octets and the ${#var} syntax counts octets, not characters. There will
+ probably be new syntax to ensure that zsh can work both in its traditional
+ way as well as when interpreting multibyte characters.
+
+
+sect(How do I ensure multibyte input works on my system?)
+
+ Once you have a version of zsh with multibyte support, you need to
+ ensure the envivronment is correct. We'll assume you're using UTF-8.
+ Many modern systems may come set up correctly already. Try one of
+ the editing widgets described in the next section to see.
+
+ There are basically three components.
+
+ itemize(
+ it() The locale. This describes a whole series of features specific
+ to countries or regions of which the character set is one. Usually
+ it is controlled by the environment variable tt(LANG) (there are
+ others but this is the one to start with). You need to find a
+ locale whose name contains mytt(UTF-8). This will be a variant on
+ your usual locale, which typically indicates the language and
+ country; for example, mine is mytt(en_GB.UTF-8). Luckily, zsh can
+ complete locale names, so if you have the new completion system
+ loaded you can type mytt(export LANG=) and attempt to complete a
+ suitable locale. It's the locale that tells the shell to expect the
+ right form of multibyte input. (However, there's no guarantee that
+ the shell is actually going to get this input: for example, if you
+ edit file names that have been created using a different character
+ set it won't work properly.)
+ it() The terminal emulator. Those that are supplied with a recent
+ desktop environment, such as gnome-terminal, are likely to have
+ extensive support for localization and may work correctly as soon
+ as they know the locale.
+ it() The font. If you selected this from a menu in your terminal
+ emulator, there's a good chance it already selected the right
+ character set to go with it. If you hand-picked an old fashioned
+ X font with a lot of dashes, you need to make sure it ends with
+ the right character encoding, mytt(iso-10646-1) (and not, for
+ example, mytt(iso-8859-1)). Not all characters will be available
+ in any font, and some fonts may have a more restricted range of
+ Unicode characters than others.
+ )
+
+
+sect(How can I input characters that aren't on my keyboard?)
+
+ Two functions are provided with zsh that help you input characters.
+ As with all editing widgets implemented by functions, you need to
+ mark the function for autoload, create the widget, and, if you are
+ going to use it frequently, bind it to a key sequence. The
+ following binds tt(insert-composed-char) to F5 on my keyboard:
+ verb(
+ autoload -Uz insert-composed-char
+ zle -N insert-composed-char
+ bindkey '\e[15~' insert-composed-char
+ )
+
+ The two widgets are described in the tt(zshcontrib(1)) manual
+ page, but here is a brief summary:
+
+ tt(insert-composed-char) is followed by two characters that
+ are a mnemonic for a multibyte character. For example mytt(a:)
+ is a with an umlaut; mytt(cH) is the symbol for hearts on a playing
+ card. Various accented characters, European and related alphabets,
+ and punctuation and mathematical symbols are available. The
+ mnemonics are mostly those given by RFC 1345, see
+ url(http://www.faqs.org/rfcs/rfc1345.html)\
+(http://www.faqs.org/rfcs/rfc1345.html).
+
+ tt(insert-unicode-char) is used to input a Unicode character by
+ its hexadecimal number. This is the number given in the Unicode
+ character charts, see for example \
+url(http://www.unicode.org/charts/)(http://www.unicode.org/charts/).
+ You need to execute the function, then type the hexadecimal number
+ (you can omit any leading zeroes), then execute the function again.
+
+ Both functions can be used without multibyte mode, provided the locale is
+ correct and the character selected exists in the current character set;
+ however, using UTF-8 massively extends the number of valid characters
+ that can be produced.
+
+
chapter(The future of zsh)
sect(What bugs are currently known and unfixed? (Plus recent \
--
Peter Stephenson <pws@csr.com> Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK Tel: +44 (0)1223 692070
This message has been scanned for viruses by BlackSpider MailControl - www.blackspider.com
next reply other threads:[~2005-12-14 18:33 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-12-14 18:31 Peter Stephenson [this message]
2005-12-14 18:41 ` Peter Stephenson
2005-12-15 14:42 ` Peter Stephenson
2005-12-14 19:25 ` [22076] " Danek Duvall
2005-12-14 21:09 ` Peter Stephenson
2005-12-16 9:39 ` Danek Duvall
2005-12-16 17:13 ` Bart Schaefer
2005-12-18 19:38 ` Danek Duvall
2005-12-18 21:10 ` Bart Schaefer
2005-12-18 14:14 ` PATCH: multibyte FAQ (MacOS X) Jun T.
2005-12-18 15:26 ` Andrey Borzenkov
2005-12-19 9:08 ` Jun T.
2005-12-18 19:41 ` Peter Stephenson
2005-12-21 16:15 ` Peter Stephenson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200512141831.jBEIV3qQ028002@news01.csr.com \
--to=pws@csr.com \
--cc=zsh-workers@sunsite.dk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/zsh/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).