zsh-workers
 help / color / mirror / code / Atom feed
* printf %s in UTF-8 is not POSIX-compliant
@ 2008-03-04  1:29 Vincent Lefevre
  2008-03-04  1:37 ` Vincent Lefevre
  2008-03-04  9:40 ` Peter Stephenson
  0 siblings, 2 replies; 12+ messages in thread
From: Vincent Lefevre @ 2008-03-04  1:29 UTC (permalink / raw)
  To: zsh-workers

Hi,

Under UTF-8 locales:

vin:~> zsh-beta -f
vin% emulate sh
vin% printf ".%2s.\n" é
. é.
vin% /usr/bin/printf ".%2s.\n" é 
.é.
vin%

As you can see, the zsh printf builtin doesn't behave like the
coreutils printf, and this is zsh which is wrong. Indeed, the
precision is the number of bytes, not the number of characters.

http://www.opengroup.org/onlinepubs/009695399/utilities/printf.html

says (in the extended description) that the "file format notation"
shall be used for the format (and %s isn't an exception).

http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap05.html

(file format notation) says:

  s
    The argument shall be taken to be a string and bytes from the
    string shall be written until the end of the string or the number
    of bytes indicated by the precision specification of the argument
    is reached. If the precision is omitted from the argument, it
    shall be taken to be infinite, so all bytes up to the end of the
    string shall be written.

Note: ksh93 has the same bug, but not pdksh and bash. But bash may
change its behavior if not under POSIX compatibility, see

  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=459413

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: printf %s in UTF-8 is not POSIX-compliant
  2008-03-04  1:29 printf %s in UTF-8 is not POSIX-compliant Vincent Lefevre
@ 2008-03-04  1:37 ` Vincent Lefevre
  2008-03-04  9:40 ` Peter Stephenson
  1 sibling, 0 replies; 12+ messages in thread
From: Vincent Lefevre @ 2008-03-04  1:37 UTC (permalink / raw)
  To: zsh-workers

I mixed up the field width and the precision, but there's the same
problem:

vin% emulate sh
vin% printf ".%.2s.\n" éabc      
.éa.
vin% /usr/bin/printf ".%.2s.\n" éabc
.é.
vin%

and POSIX says:

  field width
    An optional string of decimal digits to specify a minimum field
    width. For an output field, if the converted value has fewer
    bytes than the field width, [...]
    ^^^^^

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: printf %s in UTF-8 is not POSIX-compliant
  2008-03-04  1:29 printf %s in UTF-8 is not POSIX-compliant Vincent Lefevre
  2008-03-04  1:37 ` Vincent Lefevre
@ 2008-03-04  9:40 ` Peter Stephenson
  2008-03-05  0:27   ` Vincent Lefevre
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Stephenson @ 2008-03-04  9:40 UTC (permalink / raw)
  To: zsh-workers

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 931 bytes --]

Vincent Lefevre wrote:
> Under UTF-8 locales:
> 
> vin:~> zsh-beta -f
> vin% emulate sh
> vin% printf ".%2s.\n" é
> . é.
> vin% /usr/bin/printf ".%2s.\n" é 
> .é.
> vin%
> 
> As you can see, the zsh printf builtin doesn't behave like the
> coreutils printf, and this is zsh which is wrong. Indeed, the
> precision is the number of bytes, not the number of characters.

That seems to me useless.  I can understand in C that a string is a
low-level entity consisting of a set of bytes, but I don't see why a
shell should force the user to count the size of a multibyte character
in the particular locale.

You can fix it by unsetting the MULTIBYTE option.

printf() { emulate -L zsh; unsetopt multibyte; builtin printf "$@" }

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: printf %s in UTF-8 is not POSIX-compliant
  2008-03-04  9:40 ` Peter Stephenson
@ 2008-03-05  0:27   ` Vincent Lefevre
  2008-03-05  1:34     ` Bart Schaefer
  2008-03-05 10:41     ` Peter Stephenson
  0 siblings, 2 replies; 12+ messages in thread
From: Vincent Lefevre @ 2008-03-05  0:27 UTC (permalink / raw)
  To: zsh-workers

On 2008-03-04 09:40:07 +0000, Peter Stephenson wrote:
> That seems to me useless.

But that what's POSIX requires (and this hasn't changed in the latest
draft). Also, there may be reasons (e.g. file formats with limited
field sizes). So, zsh should follow the specification, at least when
it emulates sh, since the user may write scripts based on it.

> I can understand in C that a string is a low-level entity consisting
> of a set of bytes, but I don't see why a shell should force the user
> to count the size of a multibyte character in the particular locale.

Well, there could be an extension to give the sizes in characters
instead of bytes.

> You can fix it by unsetting the MULTIBYTE option.
> 
> printf() { emulate -L zsh; unsetopt multibyte; builtin printf "$@" }

There's a missing semi-colon:

printf() { emulate -L zsh; unsetopt multibyte; builtin printf "$@"; }

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: printf %s in UTF-8 is not POSIX-compliant
  2008-03-05  0:27   ` Vincent Lefevre
@ 2008-03-05  1:34     ` Bart Schaefer
  2008-03-06  1:27       ` Vincent Lefevre
  2008-03-05 10:41     ` Peter Stephenson
  1 sibling, 1 reply; 12+ messages in thread
From: Bart Schaefer @ 2008-03-05  1:34 UTC (permalink / raw)
  To: zsh-workers

On Mar 5,  1:27am, Vincent Lefevre wrote:
} Subject: Re: printf %s in UTF-8 is not POSIX-compliant
}
} > printf() { emulate -L zsh; unsetopt multibyte; builtin printf "$@" }
} 
} There's a missing semi-colon:

No, there isn't.  Zsh doesn't require it, even though bash and ksh do.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: printf %s in UTF-8 is not POSIX-compliant
  2008-03-05  0:27   ` Vincent Lefevre
  2008-03-05  1:34     ` Bart Schaefer
@ 2008-03-05 10:41     ` Peter Stephenson
  2008-03-06  1:39       ` Vincent Lefevre
  2008-03-06 17:09       ` Bart Schaefer
  1 sibling, 2 replies; 12+ messages in thread
From: Peter Stephenson @ 2008-03-05 10:41 UTC (permalink / raw)
  To: zsh-workers

Vincent Lefevre wrote:
> On 2008-03-04 09:40:07 +0000, Peter Stephenson wrote:
> > That seems to me useless.
> 
> But that what's POSIX requires (and this hasn't changed in the latest
> draft). Also, there may be reasons (e.g. file formats with limited
> field sizes). So, zsh should follow the specification, at least when
> it emulates sh, since the user may write scripts based on it.

There may be something we can do, but at the moment it looks more
complicated than that.  Emulations are tied to the behaviour of
interactive shells, so although it's likely you do indeed want
bog-standard byte oriented behaviour if the intention is to run a script
as sh (POSIX mostly deals in the "portable character set", broadly ASCII
so other multibyte effects are irrelevant and best turned off), it's
much less clear that turning off MULTIBYTE for all forms of sh emulation
is useful.  In particular, "emulate sh" is the nearest we have to bash
emulation and bash users are likely to expect multibyte characters to
work naturally.

Is it time to introduce a separate "bash" emulation (meaning smart,
interactive shell not necessarily 100% POSIX compatible) and
document that "sh" emulation is aimed at POSIX compatibility?
"emulate bash" already works but is treated the same way as "emulate sh".

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: printf %s in UTF-8 is not POSIX-compliant
  2008-03-05  1:34     ` Bart Schaefer
@ 2008-03-06  1:27       ` Vincent Lefevre
  0 siblings, 0 replies; 12+ messages in thread
From: Vincent Lefevre @ 2008-03-06  1:27 UTC (permalink / raw)
  To: zsh-workers

On 2008-03-04 17:34:13 -0800, Bart Schaefer wrote:
> On Mar 5,  1:27am, Vincent Lefevre wrote:
> } Subject: Re: printf %s in UTF-8 is not POSIX-compliant
> }
> } > printf() { emulate -L zsh; unsetopt multibyte; builtin printf "$@" }
> } 
> } There's a missing semi-colon:
> 
> No, there isn't.  Zsh doesn't require it, even though bash and ksh do.

Zsh does require it too:

vin:~> zsh -f
vin% emulate sh
vin% printf() { emulate -L zsh; unsetopt multibyte; builtin printf "$@" }
function>

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: printf %s in UTF-8 is not POSIX-compliant
  2008-03-05 10:41     ` Peter Stephenson
@ 2008-03-06  1:39       ` Vincent Lefevre
  2008-03-06  9:46         ` Peter Stephenson
  2008-03-06 17:09       ` Bart Schaefer
  1 sibling, 1 reply; 12+ messages in thread
From: Vincent Lefevre @ 2008-03-06  1:39 UTC (permalink / raw)
  To: zsh-workers

On 2008-03-05 10:41:48 +0000, Peter Stephenson wrote:
> In particular, "emulate sh" is the nearest we have to bash emulation
> and bash users are likely to expect multibyte characters to work
> naturally.

I don't know what you mean by "naturally", but zsh currently behaves
differently from bash in sh emulation:

vin:~> sh
sh-3.1$ printf ".%2s.\n" é
.é.
sh-3.1$ exit
vin:~> zsh -f
vin% emulate sh
vin% printf ".%2s.\n" é
. é.
vin% 

And the behavior of bash, when run as sh, will not change. So,
I expect zsh to do the same in sh emulation mode.

Note that bash still outputs .é. (POSIX behavior) when run as bash,
but this may change.

> Is it time to introduce a separate "bash" emulation (meaning smart,
> interactive shell not necessarily 100% POSIX compatible) and
> document that "sh" emulation is aimed at POSIX compatibility?
> "emulate bash" already works but is treated the same way as "emulate sh".

Perhaps it should have the same differences as bash with and without
POSIX mode. I don't know what the best behavior is about the startup
files. From the bash man page:

  [...] When invoked as sh, bash enters posix mode after the startup
  files are read.

  When bash is started in posix mode, as with the --posix command line
  option, it follows the POSIX standard for startup files. [...]

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: printf %s in UTF-8 is not POSIX-compliant
  2008-03-06  1:39       ` Vincent Lefevre
@ 2008-03-06  9:46         ` Peter Stephenson
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Stephenson @ 2008-03-06  9:46 UTC (permalink / raw)
  To: zsh-workers

On Thu, 6 Mar 2008 02:39:50 +0100
Vincent Lefevre <vincent@vinc17.org> wrote:
> On 2008-03-05 10:41:48 +0000, Peter Stephenson wrote:
> > In particular, "emulate sh" is the nearest we have to bash emulation
> > and bash users are likely to expect multibyte characters to work
> > naturally.
> 
> I don't know what you mean by "naturally", but zsh currently behaves
> differently from bash in sh emulation:

MULTIBYTE has lots of different effects.  The point is we either decide to
turn it on or off; I don't see any point in a special option for this one
very minor case.  zsh is never going to be completely compatible with every
advanced feature of bash, anyway.  So MULTIBYTE almost certainly needs to
be on in that case, but off for POSIX emulation.

> Perhaps it should have the same differences as bash with and without
> POSIX mode.

I think tracking every possible difference in "sh" and "bash" emulations,
even if they were made separate, would be going way too far.  However, if
we had several dozen more people working on the shell, one of them might
have time to look at it.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: printf %s in UTF-8 is not POSIX-compliant
  2008-03-05 10:41     ` Peter Stephenson
  2008-03-06  1:39       ` Vincent Lefevre
@ 2008-03-06 17:09       ` Bart Schaefer
  2008-03-06 17:45         ` Peter Stephenson
  1 sibling, 1 reply; 12+ messages in thread
From: Bart Schaefer @ 2008-03-06 17:09 UTC (permalink / raw)
  To: zsh-workers

On Mar 5, 10:41am, Peter Stephenson wrote:
}
} Is it time to introduce a separate "bash" emulation (meaning smart,
} interactive shell not necessarily 100% POSIX compatible) and
} document that "sh" emulation is aimed at POSIX compatibility?

After reading some of the more recent posts on this thread, I've got
an opinion on this.

I think "emulate sh" should emulate the POSIX shell to the greatest
extent possible.  If that means turning off MULTIBYTE, turn it off.
(Of course there are still subtle differences between starting the
shell as "sh" and running "emulate sh" after it has started.  There
probably isn't any way to entirely resolve that.)

However, if "emulate bash" is going to mean something other than a
synonym for "sh", then some effort should be put into being a bit
closer to bash than it's currently possible to be.  For example,
at least set the various BASH_* options, the way "emulate csh" sets
the smattering of CSH_* options.

Of course "emulate bash" isn't even in the documentation at present.
(The "Compatibilty" section referenced from the "emulate" command
doesn't discuss csh, either, even though the "emulate" doc does list
csh among the possible arguments.)

A final thought on MULTIBYTE:  Is it perhaps reasonable to split this
into two options, one that affects line editor operations and one that
affects internals?  If someone does "emulate sh; setopt zle" it seems
there might be some expectation that ZLE can adapt to a terminal that
displays multibyte even if the input is all treated as raw bytes once
accept-line hands it off.  That might mean that e.g. _main_complete
needs to look at the state of ZLE_MULTIBYTE (or whatever) and setopt
MULTIBYTE locally to correspond.  Other widgets could also be affected,
so the emphasis here is on "reasonable."

(Possible workaround:  setopt MULTIBYTE in zle_line_init and unset it
again in preexec.)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: printf %s in UTF-8 is not POSIX-compliant
  2008-03-06 17:09       ` Bart Schaefer
@ 2008-03-06 17:45         ` Peter Stephenson
  2008-03-07  2:29           ` Bart Schaefer
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Stephenson @ 2008-03-06 17:45 UTC (permalink / raw)
  To: zsh-workers

On Thu, 06 Mar 2008 09:09:01 -0800
Bart Schaefer <schaefer@brasslantern.com> wrote:
> I think "emulate sh" should emulate the POSIX shell to the greatest
> extent possible.  If that means turning off MULTIBYTE, turn it off.

That seems basically sensible.

> However, if "emulate bash" is going to mean something other than a
> synonym for "sh", then some effort should be put into being a bit
> closer to bash than it's currently possible to be.  For example,
> at least set the various BASH_* options, the way "emulate csh" sets
> the smattering of CSH_* options.

I'm not sure the first sentence agrees with the second.  Are you suggesting
new options?

> A final thought on MULTIBYTE:  Is it perhaps reasonable to split this
> into two options, one that affects line editor operations and one that
> affects internals?  If someone does "emulate sh; setopt zle" it seems
> there might be some expectation that ZLE can adapt to a terminal that
> displays multibyte even if the input is all treated as raw bytes once
> accept-line hands it off.  That might mean that e.g. _main_complete
> needs to look at the state of ZLE_MULTIBYTE (or whatever) and setopt
> MULTIBYTE locally to correspond.  Other widgets could also be affected,
> so the emphasis here is on "reasonable."

I think it can be done, and is reasonable if done properly, but is likely
to be bug-prone in the case where one option is on and the other off.  The
library code (mostly in utils.c) will need the correct option passing down
to it, widgets (including basic zle widgets) will need to be careful, and
the combination isn't likely to get well-tested anyway.

Index: Doc/Zsh/options.yo
===================================================================
RCS file: /cvsroot/zsh/zsh/Doc/Zsh/options.yo,v
retrieving revision 1.56
diff -u -r1.56 options.yo
--- Doc/Zsh/options.yo	1 Feb 2008 19:59:48 -0000	1.56
+++ Doc/Zsh/options.yo	6 Mar 2008 17:36:57 -0000
@@ -427,10 +427,10 @@
 Append a trailing `tt(/)' to all directory
 names resulting from filename generation (globbing).
 )
-pindex(MULTIBYTE <D>)
+pindex(MULTIBYTE)
 cindex(characters, multibyte, in expansion and globbing)
 cindex(multibyte characters, in expansion and globbing)
-item(tt(MULTIBYTE))(
+item(tt(MULTIBYTE) <C> <K> <Z>)(
 Respect multibyte characters when found in strings.
 When this option is set, strings are examined using the
 system library to determine how many bytes form a character, depending
@@ -438,8 +438,10 @@
 pattern matching, parameter values and various delimiters.
 
 The option is on by default if the shell was compiled with
-tt(MULTIBYTE_SUPPORT); otherwise it is off by default and has no effect if
-turned on.
+tt(MULTIBYTE_SUPPORT) except in tt(sh) emulation; otherwise it is off by
+default and has no effect if turned on.  The mode is off in tt(sh)
+emulation for compatibility but for interative use may need to be
+turned on if the terminal interprets multibyte characters.
 
 If the option is off a single byte is always treated as a single
 character.  This setting is designed purely for examining strings
Index: Src/options.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/options.c,v
retrieving revision 1.38
diff -u -r1.38 options.c
--- Src/options.c	19 Dec 2007 21:49:35 -0000	1.38
+++ Src/options.c	6 Mar 2008 17:36:57 -0000
@@ -173,7 +173,7 @@
 {{NULL, "monitor",	      OPT_SPECIAL},		 MONITOR},
 {{NULL, "multibyte",
 #ifdef MULTIBYTE_SUPPORT
-			      OPT_ALL
+			      OPT_EMULATE|OPT_ZSH|OPT_CSH|OPT_KSH
 #else
 			      0
 #endif

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: printf %s in UTF-8 is not POSIX-compliant
  2008-03-06 17:45         ` Peter Stephenson
@ 2008-03-07  2:29           ` Bart Schaefer
  0 siblings, 0 replies; 12+ messages in thread
From: Bart Schaefer @ 2008-03-07  2:29 UTC (permalink / raw)
  To: zsh-workers

On Mar 6,  5:45pm, Peter Stephenson wrote:
} Subject: Re: printf %s in UTF-8 is not POSIX-compliant
}
} On Thu, 06 Mar 2008 09:09:01 -0800
} Bart Schaefer <schaefer@brasslantern.com> wrote:
} > However, if "emulate bash" is going to mean something other than a
} > synonym for "sh", then some effort should be put into being a bit
} > closer to bash than it's currently possible to be.  For example,
} > at least set the various BASH_* options, the way "emulate csh" sets
} > the smattering of CSH_* options.
} 
} I'm not sure the first sentence agrees with the second. Are you
} suggesting new options?

Well, I considered suggesting that we comprehend bash prompt sequences,
but then decided that was going too far.  What I meant, I guess, was
"closer than it's currently possible to get by running 'emulate'".

} > A final thought on MULTIBYTE:  Is it perhaps reasonable to split this
} > into two options, one that affects line editor operations and one that
} > affects internals?
} 
} I think it can be done, and is reasonable if done properly, but is
} likely to be bug-prone in the case where one option is on and the
} other off.

Yes, that's exactly the issue.  Regardless of the size or complexity of
the job to alter the C code, it's unreasonable if the result introduces
more script bugs than it fixes incompatibilities.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-03-07  2:29 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-04  1:29 printf %s in UTF-8 is not POSIX-compliant Vincent Lefevre
2008-03-04  1:37 ` Vincent Lefevre
2008-03-04  9:40 ` Peter Stephenson
2008-03-05  0:27   ` Vincent Lefevre
2008-03-05  1:34     ` Bart Schaefer
2008-03-06  1:27       ` Vincent Lefevre
2008-03-05 10:41     ` Peter Stephenson
2008-03-06  1:39       ` Vincent Lefevre
2008-03-06  9:46         ` Peter Stephenson
2008-03-06 17:09       ` Bart Schaefer
2008-03-06 17:45         ` Peter Stephenson
2008-03-07  2:29           ` Bart Schaefer

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).