* printf %s in UTF-8 is not POSIX-compliant @ 2008-03-04 1:29 Vincent Lefevre 2008-03-04 1:37 ` Vincent Lefevre 2008-03-04 9:40 ` Peter Stephenson 0 siblings, 2 replies; 12+ messages in thread From: Vincent Lefevre @ 2008-03-04 1:29 UTC (permalink / raw) To: zsh-workers Hi, Under UTF-8 locales: vin:~> zsh-beta -f vin% emulate sh vin% printf ".%2s.\n" é . é. vin% /usr/bin/printf ".%2s.\n" é .é. vin% As you can see, the zsh printf builtin doesn't behave like the coreutils printf, and this is zsh which is wrong. Indeed, the precision is the number of bytes, not the number of characters. http://www.opengroup.org/onlinepubs/009695399/utilities/printf.html says (in the extended description) that the "file format notation" shall be used for the format (and %s isn't an exception). http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap05.html (file format notation) says: s The argument shall be taken to be a string and bytes from the string shall be written until the end of the string or the number of bytes indicated by the precision specification of the argument is reached. If the precision is omitted from the argument, it shall be taken to be infinite, so all bytes up to the end of the string shall be written. Note: ksh93 has the same bug, but not pdksh and bash. But bash may change its behavior if not under POSIX compatibility, see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=459413 -- Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/> Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: printf %s in UTF-8 is not POSIX-compliant 2008-03-04 1:29 printf %s in UTF-8 is not POSIX-compliant Vincent Lefevre @ 2008-03-04 1:37 ` Vincent Lefevre 2008-03-04 9:40 ` Peter Stephenson 1 sibling, 0 replies; 12+ messages in thread From: Vincent Lefevre @ 2008-03-04 1:37 UTC (permalink / raw) To: zsh-workers I mixed up the field width and the precision, but there's the same problem: vin% emulate sh vin% printf ".%.2s.\n" éabc .éa. vin% /usr/bin/printf ".%.2s.\n" éabc .é. vin% and POSIX says: field width An optional string of decimal digits to specify a minimum field width. For an output field, if the converted value has fewer bytes than the field width, [...] ^^^^^ -- Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/> Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: printf %s in UTF-8 is not POSIX-compliant 2008-03-04 1:29 printf %s in UTF-8 is not POSIX-compliant Vincent Lefevre 2008-03-04 1:37 ` Vincent Lefevre @ 2008-03-04 9:40 ` Peter Stephenson 2008-03-05 0:27 ` Vincent Lefevre 1 sibling, 1 reply; 12+ messages in thread From: Peter Stephenson @ 2008-03-04 9:40 UTC (permalink / raw) To: zsh-workers [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 931 bytes --] Vincent Lefevre wrote: > Under UTF-8 locales: > > vin:~> zsh-beta -f > vin% emulate sh > vin% printf ".%2s.\n" é > . é. > vin% /usr/bin/printf ".%2s.\n" é > .é. > vin% > > As you can see, the zsh printf builtin doesn't behave like the > coreutils printf, and this is zsh which is wrong. Indeed, the > precision is the number of bytes, not the number of characters. That seems to me useless. I can understand in C that a string is a low-level entity consisting of a set of bytes, but I don't see why a shell should force the user to count the size of a multibyte character in the particular locale. You can fix it by unsetting the MULTIBYTE option. printf() { emulate -L zsh; unsetopt multibyte; builtin printf "$@" } -- Peter Stephenson <pws@csr.com> Software Engineer CSR PLC, Churchill House, Cambridge Business Park, Cowley Road Cambridge, CB4 0WZ, UK Tel: +44 (0)1223 692070 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: printf %s in UTF-8 is not POSIX-compliant 2008-03-04 9:40 ` Peter Stephenson @ 2008-03-05 0:27 ` Vincent Lefevre 2008-03-05 1:34 ` Bart Schaefer 2008-03-05 10:41 ` Peter Stephenson 0 siblings, 2 replies; 12+ messages in thread From: Vincent Lefevre @ 2008-03-05 0:27 UTC (permalink / raw) To: zsh-workers On 2008-03-04 09:40:07 +0000, Peter Stephenson wrote: > That seems to me useless. But that what's POSIX requires (and this hasn't changed in the latest draft). Also, there may be reasons (e.g. file formats with limited field sizes). So, zsh should follow the specification, at least when it emulates sh, since the user may write scripts based on it. > I can understand in C that a string is a low-level entity consisting > of a set of bytes, but I don't see why a shell should force the user > to count the size of a multibyte character in the particular locale. Well, there could be an extension to give the sizes in characters instead of bytes. > You can fix it by unsetting the MULTIBYTE option. > > printf() { emulate -L zsh; unsetopt multibyte; builtin printf "$@" } There's a missing semi-colon: printf() { emulate -L zsh; unsetopt multibyte; builtin printf "$@"; } -- Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/> Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: printf %s in UTF-8 is not POSIX-compliant 2008-03-05 0:27 ` Vincent Lefevre @ 2008-03-05 1:34 ` Bart Schaefer 2008-03-06 1:27 ` Vincent Lefevre 2008-03-05 10:41 ` Peter Stephenson 1 sibling, 1 reply; 12+ messages in thread From: Bart Schaefer @ 2008-03-05 1:34 UTC (permalink / raw) To: zsh-workers On Mar 5, 1:27am, Vincent Lefevre wrote: } Subject: Re: printf %s in UTF-8 is not POSIX-compliant } } > printf() { emulate -L zsh; unsetopt multibyte; builtin printf "$@" } } } There's a missing semi-colon: No, there isn't. Zsh doesn't require it, even though bash and ksh do. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: printf %s in UTF-8 is not POSIX-compliant 2008-03-05 1:34 ` Bart Schaefer @ 2008-03-06 1:27 ` Vincent Lefevre 0 siblings, 0 replies; 12+ messages in thread From: Vincent Lefevre @ 2008-03-06 1:27 UTC (permalink / raw) To: zsh-workers On 2008-03-04 17:34:13 -0800, Bart Schaefer wrote: > On Mar 5, 1:27am, Vincent Lefevre wrote: > } Subject: Re: printf %s in UTF-8 is not POSIX-compliant > } > } > printf() { emulate -L zsh; unsetopt multibyte; builtin printf "$@" } > } > } There's a missing semi-colon: > > No, there isn't. Zsh doesn't require it, even though bash and ksh do. Zsh does require it too: vin:~> zsh -f vin% emulate sh vin% printf() { emulate -L zsh; unsetopt multibyte; builtin printf "$@" } function> -- Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/> Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: printf %s in UTF-8 is not POSIX-compliant 2008-03-05 0:27 ` Vincent Lefevre 2008-03-05 1:34 ` Bart Schaefer @ 2008-03-05 10:41 ` Peter Stephenson 2008-03-06 1:39 ` Vincent Lefevre 2008-03-06 17:09 ` Bart Schaefer 1 sibling, 2 replies; 12+ messages in thread From: Peter Stephenson @ 2008-03-05 10:41 UTC (permalink / raw) To: zsh-workers Vincent Lefevre wrote: > On 2008-03-04 09:40:07 +0000, Peter Stephenson wrote: > > That seems to me useless. > > But that what's POSIX requires (and this hasn't changed in the latest > draft). Also, there may be reasons (e.g. file formats with limited > field sizes). So, zsh should follow the specification, at least when > it emulates sh, since the user may write scripts based on it. There may be something we can do, but at the moment it looks more complicated than that. Emulations are tied to the behaviour of interactive shells, so although it's likely you do indeed want bog-standard byte oriented behaviour if the intention is to run a script as sh (POSIX mostly deals in the "portable character set", broadly ASCII so other multibyte effects are irrelevant and best turned off), it's much less clear that turning off MULTIBYTE for all forms of sh emulation is useful. In particular, "emulate sh" is the nearest we have to bash emulation and bash users are likely to expect multibyte characters to work naturally. Is it time to introduce a separate "bash" emulation (meaning smart, interactive shell not necessarily 100% POSIX compatible) and document that "sh" emulation is aimed at POSIX compatibility? "emulate bash" already works but is treated the same way as "emulate sh". -- Peter Stephenson <pws@csr.com> Software Engineer CSR PLC, Churchill House, Cambridge Business Park, Cowley Road Cambridge, CB4 0WZ, UK Tel: +44 (0)1223 692070 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: printf %s in UTF-8 is not POSIX-compliant 2008-03-05 10:41 ` Peter Stephenson @ 2008-03-06 1:39 ` Vincent Lefevre 2008-03-06 9:46 ` Peter Stephenson 2008-03-06 17:09 ` Bart Schaefer 1 sibling, 1 reply; 12+ messages in thread From: Vincent Lefevre @ 2008-03-06 1:39 UTC (permalink / raw) To: zsh-workers On 2008-03-05 10:41:48 +0000, Peter Stephenson wrote: > In particular, "emulate sh" is the nearest we have to bash emulation > and bash users are likely to expect multibyte characters to work > naturally. I don't know what you mean by "naturally", but zsh currently behaves differently from bash in sh emulation: vin:~> sh sh-3.1$ printf ".%2s.\n" é .é. sh-3.1$ exit vin:~> zsh -f vin% emulate sh vin% printf ".%2s.\n" é . é. vin% And the behavior of bash, when run as sh, will not change. So, I expect zsh to do the same in sh emulation mode. Note that bash still outputs .é. (POSIX behavior) when run as bash, but this may change. > Is it time to introduce a separate "bash" emulation (meaning smart, > interactive shell not necessarily 100% POSIX compatible) and > document that "sh" emulation is aimed at POSIX compatibility? > "emulate bash" already works but is treated the same way as "emulate sh". Perhaps it should have the same differences as bash with and without POSIX mode. I don't know what the best behavior is about the startup files. From the bash man page: [...] When invoked as sh, bash enters posix mode after the startup files are read. When bash is started in posix mode, as with the --posix command line option, it follows the POSIX standard for startup files. [...] -- Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/> Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: printf %s in UTF-8 is not POSIX-compliant 2008-03-06 1:39 ` Vincent Lefevre @ 2008-03-06 9:46 ` Peter Stephenson 0 siblings, 0 replies; 12+ messages in thread From: Peter Stephenson @ 2008-03-06 9:46 UTC (permalink / raw) To: zsh-workers On Thu, 6 Mar 2008 02:39:50 +0100 Vincent Lefevre <vincent@vinc17.org> wrote: > On 2008-03-05 10:41:48 +0000, Peter Stephenson wrote: > > In particular, "emulate sh" is the nearest we have to bash emulation > > and bash users are likely to expect multibyte characters to work > > naturally. > > I don't know what you mean by "naturally", but zsh currently behaves > differently from bash in sh emulation: MULTIBYTE has lots of different effects. The point is we either decide to turn it on or off; I don't see any point in a special option for this one very minor case. zsh is never going to be completely compatible with every advanced feature of bash, anyway. So MULTIBYTE almost certainly needs to be on in that case, but off for POSIX emulation. > Perhaps it should have the same differences as bash with and without > POSIX mode. I think tracking every possible difference in "sh" and "bash" emulations, even if they were made separate, would be going way too far. However, if we had several dozen more people working on the shell, one of them might have time to look at it. -- Peter Stephenson <pws@csr.com> Software Engineer CSR PLC, Churchill House, Cambridge Business Park, Cowley Road Cambridge, CB4 0WZ, UK Tel: +44 (0)1223 692070 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: printf %s in UTF-8 is not POSIX-compliant 2008-03-05 10:41 ` Peter Stephenson 2008-03-06 1:39 ` Vincent Lefevre @ 2008-03-06 17:09 ` Bart Schaefer 2008-03-06 17:45 ` Peter Stephenson 1 sibling, 1 reply; 12+ messages in thread From: Bart Schaefer @ 2008-03-06 17:09 UTC (permalink / raw) To: zsh-workers On Mar 5, 10:41am, Peter Stephenson wrote: } } Is it time to introduce a separate "bash" emulation (meaning smart, } interactive shell not necessarily 100% POSIX compatible) and } document that "sh" emulation is aimed at POSIX compatibility? After reading some of the more recent posts on this thread, I've got an opinion on this. I think "emulate sh" should emulate the POSIX shell to the greatest extent possible. If that means turning off MULTIBYTE, turn it off. (Of course there are still subtle differences between starting the shell as "sh" and running "emulate sh" after it has started. There probably isn't any way to entirely resolve that.) However, if "emulate bash" is going to mean something other than a synonym for "sh", then some effort should be put into being a bit closer to bash than it's currently possible to be. For example, at least set the various BASH_* options, the way "emulate csh" sets the smattering of CSH_* options. Of course "emulate bash" isn't even in the documentation at present. (The "Compatibilty" section referenced from the "emulate" command doesn't discuss csh, either, even though the "emulate" doc does list csh among the possible arguments.) A final thought on MULTIBYTE: Is it perhaps reasonable to split this into two options, one that affects line editor operations and one that affects internals? If someone does "emulate sh; setopt zle" it seems there might be some expectation that ZLE can adapt to a terminal that displays multibyte even if the input is all treated as raw bytes once accept-line hands it off. That might mean that e.g. _main_complete needs to look at the state of ZLE_MULTIBYTE (or whatever) and setopt MULTIBYTE locally to correspond. Other widgets could also be affected, so the emphasis here is on "reasonable." (Possible workaround: setopt MULTIBYTE in zle_line_init and unset it again in preexec.) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: printf %s in UTF-8 is not POSIX-compliant 2008-03-06 17:09 ` Bart Schaefer @ 2008-03-06 17:45 ` Peter Stephenson 2008-03-07 2:29 ` Bart Schaefer 0 siblings, 1 reply; 12+ messages in thread From: Peter Stephenson @ 2008-03-06 17:45 UTC (permalink / raw) To: zsh-workers On Thu, 06 Mar 2008 09:09:01 -0800 Bart Schaefer <schaefer@brasslantern.com> wrote: > I think "emulate sh" should emulate the POSIX shell to the greatest > extent possible. If that means turning off MULTIBYTE, turn it off. That seems basically sensible. > However, if "emulate bash" is going to mean something other than a > synonym for "sh", then some effort should be put into being a bit > closer to bash than it's currently possible to be. For example, > at least set the various BASH_* options, the way "emulate csh" sets > the smattering of CSH_* options. I'm not sure the first sentence agrees with the second. Are you suggesting new options? > A final thought on MULTIBYTE: Is it perhaps reasonable to split this > into two options, one that affects line editor operations and one that > affects internals? If someone does "emulate sh; setopt zle" it seems > there might be some expectation that ZLE can adapt to a terminal that > displays multibyte even if the input is all treated as raw bytes once > accept-line hands it off. That might mean that e.g. _main_complete > needs to look at the state of ZLE_MULTIBYTE (or whatever) and setopt > MULTIBYTE locally to correspond. Other widgets could also be affected, > so the emphasis here is on "reasonable." I think it can be done, and is reasonable if done properly, but is likely to be bug-prone in the case where one option is on and the other off. The library code (mostly in utils.c) will need the correct option passing down to it, widgets (including basic zle widgets) will need to be careful, and the combination isn't likely to get well-tested anyway. Index: Doc/Zsh/options.yo =================================================================== RCS file: /cvsroot/zsh/zsh/Doc/Zsh/options.yo,v retrieving revision 1.56 diff -u -r1.56 options.yo --- Doc/Zsh/options.yo 1 Feb 2008 19:59:48 -0000 1.56 +++ Doc/Zsh/options.yo 6 Mar 2008 17:36:57 -0000 @@ -427,10 +427,10 @@ Append a trailing `tt(/)' to all directory names resulting from filename generation (globbing). ) -pindex(MULTIBYTE <D>) +pindex(MULTIBYTE) cindex(characters, multibyte, in expansion and globbing) cindex(multibyte characters, in expansion and globbing) -item(tt(MULTIBYTE))( +item(tt(MULTIBYTE) <C> <K> <Z>)( Respect multibyte characters when found in strings. When this option is set, strings are examined using the system library to determine how many bytes form a character, depending @@ -438,8 +438,10 @@ pattern matching, parameter values and various delimiters. The option is on by default if the shell was compiled with -tt(MULTIBYTE_SUPPORT); otherwise it is off by default and has no effect if -turned on. +tt(MULTIBYTE_SUPPORT) except in tt(sh) emulation; otherwise it is off by +default and has no effect if turned on. The mode is off in tt(sh) +emulation for compatibility but for interative use may need to be +turned on if the terminal interprets multibyte characters. If the option is off a single byte is always treated as a single character. This setting is designed purely for examining strings Index: Src/options.c =================================================================== RCS file: /cvsroot/zsh/zsh/Src/options.c,v retrieving revision 1.38 diff -u -r1.38 options.c --- Src/options.c 19 Dec 2007 21:49:35 -0000 1.38 +++ Src/options.c 6 Mar 2008 17:36:57 -0000 @@ -173,7 +173,7 @@ {{NULL, "monitor", OPT_SPECIAL}, MONITOR}, {{NULL, "multibyte", #ifdef MULTIBYTE_SUPPORT - OPT_ALL + OPT_EMULATE|OPT_ZSH|OPT_CSH|OPT_KSH #else 0 #endif -- Peter Stephenson <pws@csr.com> Software Engineer CSR PLC, Churchill House, Cambridge Business Park, Cowley Road Cambridge, CB4 0WZ, UK Tel: +44 (0)1223 692070 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: printf %s in UTF-8 is not POSIX-compliant 2008-03-06 17:45 ` Peter Stephenson @ 2008-03-07 2:29 ` Bart Schaefer 0 siblings, 0 replies; 12+ messages in thread From: Bart Schaefer @ 2008-03-07 2:29 UTC (permalink / raw) To: zsh-workers On Mar 6, 5:45pm, Peter Stephenson wrote: } Subject: Re: printf %s in UTF-8 is not POSIX-compliant } } On Thu, 06 Mar 2008 09:09:01 -0800 } Bart Schaefer <schaefer@brasslantern.com> wrote: } > However, if "emulate bash" is going to mean something other than a } > synonym for "sh", then some effort should be put into being a bit } > closer to bash than it's currently possible to be. For example, } > at least set the various BASH_* options, the way "emulate csh" sets } > the smattering of CSH_* options. } } I'm not sure the first sentence agrees with the second. Are you } suggesting new options? Well, I considered suggesting that we comprehend bash prompt sequences, but then decided that was going too far. What I meant, I guess, was "closer than it's currently possible to get by running 'emulate'". } > A final thought on MULTIBYTE: Is it perhaps reasonable to split this } > into two options, one that affects line editor operations and one that } > affects internals? } } I think it can be done, and is reasonable if done properly, but is } likely to be bug-prone in the case where one option is on and the } other off. Yes, that's exactly the issue. Regardless of the size or complexity of the job to alter the C code, it's unreasonable if the result introduces more script bugs than it fixes incompatibilities. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2008-03-07 2:29 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-03-04 1:29 printf %s in UTF-8 is not POSIX-compliant Vincent Lefevre 2008-03-04 1:37 ` Vincent Lefevre 2008-03-04 9:40 ` Peter Stephenson 2008-03-05 0:27 ` Vincent Lefevre 2008-03-05 1:34 ` Bart Schaefer 2008-03-06 1:27 ` Vincent Lefevre 2008-03-05 10:41 ` Peter Stephenson 2008-03-06 1:39 ` Vincent Lefevre 2008-03-06 9:46 ` Peter Stephenson 2008-03-06 17:09 ` Bart Schaefer 2008-03-06 17:45 ` Peter Stephenson 2008-03-07 2:29 ` Bart Schaefer
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).