printf %<n>s in UTF-8 is not always POSIX-compliant

zsh-workers
 help / color / mirror / code / Atom feed

* printf %<n>s in UTF-8 is not always POSIX-compliant
@ 2012-02-15  2:15 Vincent Lefevre
  2012-02-15  8:14 ` Bart Schaefer
  0 siblings, 1 reply; 10+ messages in thread
From: Vincent Lefevre @ 2012-02-15  2:15 UTC (permalink / raw)
  To: zsh-workers

Hi,

I've reported the following bug:

  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=659932

In UTF-8 locales:

xvii% printf ".%2s.\n" é
. é.
xvii% emulate sh
xvii% printf ".%2s.\n" é
.é.
xvii% emulate ksh       
xvii% printf ".%2s.\n" é
. é.

It is correct in sh mode (according to POSIX[*]), but not in ksh mode,
which should also follow the POSIX behavior. What about zsh mode?

[*] http://pubs.opengroup.org/onlinepubs/9699919799/utilities/printf.html
and
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap05.html#tag_05
for %<n>s.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: printf %<n>s in UTF-8 is not always POSIX-compliant
  2012-02-15  2:15 printf %<n>s in UTF-8 is not always POSIX-compliant Vincent Lefevre
@ 2012-02-15  8:14 ` Bart Schaefer
  2012-02-15  9:10   ` Vincent Lefevre
                     ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Bart Schaefer @ 2012-02-15  8:14 UTC (permalink / raw)
  To: zsh-workers

On Feb 15,  3:15am, Vincent Lefevre wrote:
}
} In UTF-8 locales:
} 
} xvii% printf ".%2s.\n" é
} .é.

Am I understanding correctly that the intent here is that é is a two-
byte character so %2s should print the two literal bytes, rather than
print the single logical character in a field two logical characters
wide?

The reason it's different for "emulate sh" is that sh emulation turns
off all support for multibyte characters (unsetopt multibyte).  If you
were to do
	emulate sh -c 'setopt multibyte; printf ".%2s.\n" é'
then I believe you'd see the same behavior as with "emulate ksh".

As to whether it's correct ... I think I'd prefer the logical rather
than literal interpretation, but it'll be difficult [or a hack that
requires looking at the global emulation state, so it won't be possible
to reproduce it with plain setopts] to turn off multibyte processing in
printf for ksh emulation but not native zsh.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: printf %<n>s in UTF-8 is not always POSIX-compliant
  2012-02-15  8:14 ` Bart Schaefer
@ 2012-02-15  9:10   ` Vincent Lefevre
  2012-02-15 11:05   ` Peter Stephenson
  2012-02-15 14:42   ` Oliver Kiddle
  2 siblings, 0 replies; 10+ messages in thread
From: Vincent Lefevre @ 2012-02-15  9:10 UTC (permalink / raw)
  To: zsh-workers

On 2012-02-15 00:14:12 -0800, Bart Schaefer wrote:
> On Feb 15,  3:15am, Vincent Lefevre wrote:
> }
> } In UTF-8 locales:
> } 
> } xvii% printf ".%2s.\n" é
> } .é.
> 
> Am I understanding correctly that the intent here is that é is a two-
> byte character so %2s should print the two literal bytes, rather than
> print the single logical character in a field two logical characters
> wide?

Yes, the number is the size in bytes, not in characters. I think
that the intent is to deal with internal structures (e.g. with
file formats where some fields have a fixed or limited size, and
the same syntax can be used in C to avoid buffer overflows).
Note that there's the same problem with:

xvii% printf ".%.3s.\n" éabcd
.éab.
xvii% emulate ksh
xvii% printf ".%.3s.\n" éabcd
.éab.
xvii% emulate sh             
xvii% printf ".%.3s.\n" éabcd
.éa.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: printf %<n>s in UTF-8 is not always POSIX-compliant
  2012-02-15  8:14 ` Bart Schaefer
  2012-02-15  9:10   ` Vincent Lefevre
@ 2012-02-15 11:05   ` Peter Stephenson
  2012-02-15 11:53     ` Vincent Lefevre
  2012-02-15 14:42   ` Oliver Kiddle
  2 siblings, 1 reply; 10+ messages in thread
From: Peter Stephenson @ 2012-02-15 11:05 UTC (permalink / raw)
  To: zsh-workers

On Wed, 15 Feb 2012 00:14:12 -0800
Bart Schaefer <schaefer@brasslantern.com> wrote:
> The reason it's different for "emulate sh" is that sh emulation turns
> off all support for multibyte characters (unsetopt multibyte).  If you
> were to do
> 	emulate sh -c 'setopt multibyte; printf ".%2s.\n" é'
> then I believe you'd see the same behavior as with "emulate ksh".
> 
> As to whether it's correct ... I think I'd prefer the logical rather
> than literal interpretation, but it'll be difficult [or a hack that
> requires looking at the global emulation state, so it won't be possible
> to reproduce it with plain setopts] to turn off multibyte processing in
> printf for ksh emulation but not native zsh.

This sounds correct... We've never promised ksh mode would be a complete
representation of ksh anyway.  I realise that, for historical reasons
related to standards rather than zsh, you'd expect ksh mode to be POSIX
compatible, but actually we don't tend to bother because ksh mode isn't
that widely used and so doesn't get a lot of attention (I certainly
never use it).  If you really want compatibility native zsh mode or sh
mode are the sensible choices.

So probably the fix is to spread fear, uncertainty and doubt about ksh
mode.  I'll start right now.

If there is a hard-core ksh mode user who'd like to maintain it, of
course, that's another story.

-- 
Peter Stephenson <pws@csr.com>            Software Engineer
Tel: +44 (0)1223 692070                   Cambridge Silicon Radio Limited
Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, UK

Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom
More information can be found at www.csr.com. Follow CSR on Twitter at http://twitter.com/CSR_PLC and read our blog at www.csr.com/blog

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: printf %<n>s in UTF-8 is not always POSIX-compliant
  2012-02-15 11:05   ` Peter Stephenson
@ 2012-02-15 11:53     ` Vincent Lefevre
  2012-02-15 12:09       ` Frank Terbeck
  0 siblings, 1 reply; 10+ messages in thread
From: Vincent Lefevre @ 2012-02-15 11:53 UTC (permalink / raw)
  To: zsh-workers

On 2012-02-15 11:05:19 +0000, Peter Stephenson wrote:
> This sounds correct... We've never promised ksh mode would be a complete
> representation of ksh anyway.  I realise that, for historical reasons
> related to standards rather than zsh, you'd expect ksh mode to be POSIX
> compatible, but actually we don't tend to bother because ksh mode isn't
> that widely used and so doesn't get a lot of attention (I certainly
> never use it).  If you really want compatibility native zsh mode or sh
> mode are the sensible choices.

The problem is that on some machines, one has a symlink ksh -> zsh.
If I type ksh or run a script with #!/usr/bin/ksh, I expect this to
behave as a real ksh.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: printf %<n>s in UTF-8 is not always POSIX-compliant
  2012-02-15 11:53     ` Vincent Lefevre
@ 2012-02-15 12:09       ` Frank Terbeck
  2012-02-15 12:23         ` Peter Stephenson
  2012-02-15 12:42         ` Vincent Lefevre
  0 siblings, 2 replies; 10+ messages in thread
From: Frank Terbeck @ 2012-02-15 12:09 UTC (permalink / raw)
  To: zsh-workers

Vincent Lefevre wrote:
> On 2012-02-15 11:05:19 +0000, Peter Stephenson wrote:
>> This sounds correct... We've never promised ksh mode would be a complete
>> representation of ksh anyway.
[...]
> The problem is that on some machines, one has a symlink ksh -> zsh.
> If I type ksh or run a script with #!/usr/bin/ksh, I expect this to
> behave as a real ksh.

Frankly, that would be the vendor's fault then. There are many *MANY*
ksh implementations, that make for a reasonable link target (ksh93,
pdksh or mksh - to name just a few). Zsh is not one of them.

IMHO, ksh-emulation is a little bit like csh emulation: It's meant to
make users with ksh background feel more "at home", not as a strict
bug-for-bug emulation.

Regards, Frank

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: printf %<n>s in UTF-8 is not always POSIX-compliant
  2012-02-15 12:09       ` Frank Terbeck
@ 2012-02-15 12:23         ` Peter Stephenson
  2012-02-15 12:42         ` Vincent Lefevre
  1 sibling, 0 replies; 10+ messages in thread
From: Peter Stephenson @ 2012-02-15 12:23 UTC (permalink / raw)
  To: zsh-workers

On Wed, 15 Feb 2012 13:09:17 +0100
Frank Terbeck <ft@bewatermyfriend.org> wrote:
> IMHO, ksh-emulation is a little bit like csh emulation: It's meant to
> make users with ksh background feel more "at home", not as a strict
> bug-for-bug emulation.

That's basically how I see it.  It doesn't mean we can't do better --- but
I don't think we can do better by people who don't really use the mode
initiating random tweaks in the hope that the world becomes a better
place.  We really would need someone who is in a position to take a
global view of how changes to the mode affect the emulation.

This is a rather different case from POSIX emulation, where there's (i)
a standard (ii) quite a lot of visibility of what the effect of changes
are.

-- 
Peter Stephenson <pws@csr.com>            Software Engineer
Tel: +44 (0)1223 692070                   Cambridge Silicon Radio Limited
Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, UK

Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom
More information can be found at www.csr.com. Follow CSR on Twitter at http://twitter.com/CSR_PLC and read our blog at www.csr.com/blog

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: printf %<n>s in UTF-8 is not always POSIX-compliant
  2012-02-15 12:09       ` Frank Terbeck
  2012-02-15 12:23         ` Peter Stephenson
@ 2012-02-15 12:42         ` Vincent Lefevre
  1 sibling, 0 replies; 10+ messages in thread
From: Vincent Lefevre @ 2012-02-15 12:42 UTC (permalink / raw)
  To: zsh-workers

On 2012-02-15 13:09:17 +0100, Frank Terbeck wrote:
> Frankly, that would be the vendor's fault then. There are many *MANY*
> ksh implementations, that make for a reasonable link target (ksh93,
> pdksh or mksh - to name just a few). Zsh is not one of them.

OK, bug reported.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: printf %<n>s in UTF-8 is not always POSIX-compliant
  2012-02-15  8:14 ` Bart Schaefer
  2012-02-15  9:10   ` Vincent Lefevre
  2012-02-15 11:05   ` Peter Stephenson
@ 2012-02-15 14:42   ` Oliver Kiddle
  2012-02-15 14:56     ` Vincent Lefevre
  2 siblings, 1 reply; 10+ messages in thread
From: Oliver Kiddle @ 2012-02-15 14:42 UTC (permalink / raw)
  To: zsh-workers

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 870 bytes --]

Bart wrote:
> 
> Am I understanding correctly that the intent here is that é is a two-
> byte character so %2s should print the two literal bytes, rather than
> print the single logical character in a field two logical characters
> wide?

That's correct. The POSIX definition uses bytes. For multibyte
behaviour, there is an L modifier. I don't really see the sense in it
myself: I don't want to write low-level stuff in the shell.

Frank Terbeck wrote:
> Frankly, that would be the vendor's fault then. There are many *MANY*
> ksh implementations, that make for a reasonable link target (ksh93,
> pdksh or mksh - to name just a few). Zsh is not one of them.

The fact that zsh is far from a perfect emulation doesn't stop it from
being useful. I don't necessarily want to install a separate ksh package
and zsh will run ksh scripts at least as well as pdksh.

Oliver


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: printf %<n>s in UTF-8 is not always POSIX-compliant
  2012-02-15 14:42   ` Oliver Kiddle
@ 2012-02-15 14:56     ` Vincent Lefevre
  0 siblings, 0 replies; 10+ messages in thread
From: Vincent Lefevre @ 2012-02-15 14:56 UTC (permalink / raw)
  To: zsh-workers

On 2012-02-15 15:42:15 +0100, Oliver Kiddle wrote:
> Bart wrote:
> > Am I understanding correctly that the intent here is that ?? is a two-
> > byte character so %2s should print the two literal bytes, rather than
> > print the single logical character in a field two logical characters
> > wide?
> 
> That's correct. The POSIX definition uses bytes. For multibyte
> behaviour, there is an L modifier. I don't really see the sense in it
> myself: I don't want to write low-level stuff in the shell.

I think that's for consistency with C. Also, the shell could then
be used as a front-end to test string-related things.

> Frank Terbeck wrote:
> > Frankly, that would be the vendor's fault then. There are many *MANY*
> > ksh implementations, that make for a reasonable link target (ksh93,
> > pdksh or mksh - to name just a few). Zsh is not one of them.
> 
> The fact that zsh is far from a perfect emulation doesn't stop it from
> being useful. I don't necessarily want to install a separate ksh package
> and zsh will run ksh scripts at least as well as pdksh.

But then the emulation should be correct.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-02-15 14:56 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-15  2:15 printf %<n>s in UTF-8 is not always POSIX-compliant Vincent Lefevre
2012-02-15  8:14 ` Bart Schaefer
2012-02-15  9:10   ` Vincent Lefevre
2012-02-15 11:05   ` Peter Stephenson
2012-02-15 11:53     ` Vincent Lefevre
2012-02-15 12:09       ` Frank Terbeck
2012-02-15 12:23         ` Peter Stephenson
2012-02-15 12:42         ` Vincent Lefevre
2012-02-15 14:42   ` Oliver Kiddle
2012-02-15 14:56     ` Vincent Lefevre

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).