zsh-workers
 help / color / mirror / code / Atom feed
* In POSIX mode, ${#var} measures length in bytes, not characters
@ 2015-06-07  0:23 Martijn Dekker
  2015-06-07  0:34 ` ZyX
  2015-06-08  8:44 ` Peter Stephenson
  0 siblings, 2 replies; 13+ messages in thread
From: Martijn Dekker @ 2015-06-07  0:23 UTC (permalink / raw)
  To: zsh-workers

When in 'emulate sh' mode, ${#var} substitutes the length of the
variable in bytes, not characters. This is contrary to the standard; the
length in characters is supposed to be substituted.[*]

Oddly enough, zsh is POSIX compliant here in native mode, but
non-compliant in POSIX mode.

Confirmed in zsh 4.3.11 (Mac OS X), 5.0.2 (Linux) and 5.0.8 (Mac OS X).

$ zsh
% locale
LANG="nl_NL.UTF-8"
LC_COLLATE="nl_NL.UTF-8"
LC_CTYPE="nl_NL.UTF-8"
LC_MESSAGES="nl_NL.UTF-8"
LC_MONETARY="nl_NL.UTF-8"
LC_NUMERIC="nl_NL.UTF-8"
LC_TIME="nl_NL.UTF-8"
LC_ALL=
% mot=arrêté
% echo ${#mot}
6
% emulate sh
% echo ${#mot}
8

- Martijn

[*] Reference:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_06_02
> ${#parameter}
>     String Length. The length in characters of the value of parameter
>     shall be substituted. [...]


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: In POSIX mode, ${#var} measures length in bytes, not characters
  2015-06-07  0:23 In POSIX mode, ${#var} measures length in bytes, not characters Martijn Dekker
@ 2015-06-07  0:34 ` ZyX
  2015-06-07  1:23   ` Bart Schaefer
  2015-06-07  2:21   ` Martijn Dekker
  2015-06-08  8:44 ` Peter Stephenson
  1 sibling, 2 replies; 13+ messages in thread
From: ZyX @ 2015-06-07  0:34 UTC (permalink / raw)
  To: Martijn Dekker, zsh-workers

07.06.2015, 03:29, "Martijn Dekker" <martijn@inlv.org>:
> When in 'emulate sh' mode, ${#var} substitutes the length of the
> variable in bytes, not characters. This is contrary to the standard; the
> length in characters is supposed to be substituted.[*]
>
> Oddly enough, zsh is POSIX compliant here in native mode, but
> non-compliant in POSIX mode.

Do you have a reference where “character” is defined? This behaviour is the same in posh and dash:

    % posh -c 'VAR="«»"; echo ${#VAR}'
    4
    % dash -c 'VAR="«»"; echo ${#VAR}'
    4
    % zsh -c 'VAR="«»"; echo ${#VAR}' # Non-POSIX mode: length in Unicode codepoints for comparison
    2
    % locale
    LANG=ru_RU.UTF-8
    LC_CTYPE="ru_RU.UTF-8"
    LC_NUMERIC="ru_RU.UTF-8"
    LC_TIME="ru_RU.UTF-8"
    LC_COLLATE="ru_RU.UTF-8"
    LC_MONETARY="ru_RU.UTF-8"
    LC_MESSAGES="ru_RU.UTF-8"
    LC_PAPER="ru_RU.UTF-8"
    LC_NAME="ru_RU.UTF-8"
    LC_ADDRESS="ru_RU.UTF-8"
    LC_TELEPHONE="ru_RU.UTF-8"
    LC_MEASUREMENT="ru_RU.UTF-8"
    LC_IDENTIFICATION="ru_RU.UTF-8"
    LC_ALL=

>
> Confirmed in zsh 4.3.11 (Mac OS X), 5.0.2 (Linux) and 5.0.8 (Mac OS X).
>
> $ zsh
> % locale
> LANG="nl_NL.UTF-8"
> LC_COLLATE="nl_NL.UTF-8"
> LC_CTYPE="nl_NL.UTF-8"
> LC_MESSAGES="nl_NL.UTF-8"
> LC_MONETARY="nl_NL.UTF-8"
> LC_NUMERIC="nl_NL.UTF-8"
> LC_TIME="nl_NL.UTF-8"
> LC_ALL=
> % mot=arrêté
> % echo ${#mot}
> 6
> % emulate sh
> % echo ${#mot}
> 8
>
> - Martijn
>
> [*] Reference:
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_06_02
>>  ${#parameter}
>>      String Length. The length in characters of the value of parameter
>>      shall be substituted. [...]


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: In POSIX mode, ${#var} measures length in bytes, not characters
  2015-06-07  0:34 ` ZyX
@ 2015-06-07  1:23   ` Bart Schaefer
  2015-06-07  1:27     ` ZyX
  2015-06-07  2:21   ` Martijn Dekker
  1 sibling, 1 reply; 13+ messages in thread
From: Bart Schaefer @ 2015-06-07  1:23 UTC (permalink / raw)
  To: zsh-workers

On Jun 7,  3:34am, ZyX wrote:
} Subject: Re: In POSIX mode, ${#var} measures length in bytes, not characte
}
} 07.06.2015, 03:29, "Martijn Dekker" <martijn@inlv.org>:
} > When in 'emulate sh' mode, ${#var} substitutes the length of the
} > variable in bytes, not characters. This is contrary to the standard; the
} > length in characters is supposed to be substituted.[*]
} >
} > Oddly enough, zsh is POSIX compliant here in native mode, but
} > non-compliant in POSIX mode.
} 
} Do you have a reference where "character" is defined? This behaviour
} is the same in posh and dash:

I thought this was discussed on the austin-group list but I can't find a
search term to dig it out of the archives.  (I did find the thread about
${##} as a side-effect of the attempt.)

On my Ubuntu 12.04 system, /bin/sh is a link to dash, so zsh invoked
as sh behaves like /bin/sh.  On CentOS, /bin/sh links to bash, so they
behave differently.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: In POSIX mode, ${#var} measures length in bytes, not characters
  2015-06-07  1:23   ` Bart Schaefer
@ 2015-06-07  1:27     ` ZyX
  0 siblings, 0 replies; 13+ messages in thread
From: ZyX @ 2015-06-07  1:27 UTC (permalink / raw)
  To: Bart Schaefer, zsh-workers



07.06.2015, 04:24, "Bart Schaefer" <schaefer@brasslantern.com>:
> On Jun 7,  3:34am, ZyX wrote:
> } Subject: Re: In POSIX mode, ${#var} measures length in bytes, not characte
> }
> } 07.06.2015, 03:29, "Martijn Dekker" <martijn@inlv.org>:
> } > When in 'emulate sh' mode, ${#var} substitutes the length of the
> } > variable in bytes, not characters. This is contrary to the standard; the
> } > length in characters is supposed to be substituted.[*]
> } >
> } > Oddly enough, zsh is POSIX compliant here in native mode, but
> } > non-compliant in POSIX mode.
> }
> } Do you have a reference where "character" is defined? This behaviour
> } is the same in posh and dash:
>
> I thought this was discussed on the austin-group list but I can't find a
> search term to dig it out of the archives.  (I did find the thread about
> ${##} as a side-effect of the attempt.)
>
> On my Ubuntu 12.04 system, /bin/sh is a link to dash, so zsh invoked
> as sh behaves like /bin/sh.  On CentOS, /bin/sh links to bash, so they
> behave differently.

Just checked out busybox (ash), it also emits 2 (recognizes unicode codepoints). Ksh emits 2, mksh 4.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: In POSIX mode, ${#var} measures length in bytes, not characters
  2015-06-07  0:34 ` ZyX
  2015-06-07  1:23   ` Bart Schaefer
@ 2015-06-07  2:21   ` Martijn Dekker
  2015-06-09 17:49     ` Martijn Dekker
  1 sibling, 1 reply; 13+ messages in thread
From: Martijn Dekker @ 2015-06-07  2:21 UTC (permalink / raw)
  To: zsh-workers

ZyX schreef op 07-06-15 om 02:34:
> Do you have a reference where “character” is defined?

Yes:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_02

POSIX specifically allows any character encoding, including multibyte
characters, depending on the user's locale, and on the condition that
the portable character set (basically US-ASCII) is a subset of the
locale's character set.

With UTF-8 now the de facto standard locale and it including multibyte
characters, it's become important for shells to get this right.

> This behaviour is the same in posh and dash:

Yes, dash and pdksh/mksh/posh unfortunately have this bug, too.

But bash, ksh93, and yash correctly measure characters, not bytes. (yash
is supposed to be the most POSIX-compliant of them all.)

Thanks,

- Martijn


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: In POSIX mode, ${#var} measures length in bytes, not characters
  2015-06-07  0:23 In POSIX mode, ${#var} measures length in bytes, not characters Martijn Dekker
  2015-06-07  0:34 ` ZyX
@ 2015-06-08  8:44 ` Peter Stephenson
  2015-06-09  2:19   ` Martijn Dekker
  2015-06-11 11:23   ` Daniel Shahaf
  1 sibling, 2 replies; 13+ messages in thread
From: Peter Stephenson @ 2015-06-08  8:44 UTC (permalink / raw)
  To: zsh-workers

When we started multibyte support, traditional sh's all only supported
the portable character set, so we let zsh behave as if characters were
all 8 bit as the least disruptive change.  Multibyte is on in all other
emulations, even ksh.

With multibyte character sets virtually universal, it's probably time to
assume obeying the localization screws things up least.

With MULTIBYTE always on when available, the "EMULATE" flag becomes
redundant.

pws

diff --git a/Src/options.c b/Src/options.c
index 3e3e074..78f603d 100644
--- a/Src/options.c
+++ b/Src/options.c
@@ -192,7 +192,7 @@ static struct optname optns[] = {
 {{NULL, "monitor",	      OPT_SPECIAL},		 MONITOR},
 {{NULL, "multibyte",
 #ifdef MULTIBYTE_SUPPORT
-			      OPT_EMULATE|OPT_ZSH|OPT_CSH|OPT_KSH
+			      OPT_ALL
 #else
 			      0
 #endif


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: In POSIX mode, ${#var} measures length in bytes, not characters
  2015-06-08  8:44 ` Peter Stephenson
@ 2015-06-09  2:19   ` Martijn Dekker
  2015-06-09  8:35     ` Peter Stephenson
  2015-06-11 11:23   ` Daniel Shahaf
  1 sibling, 1 reply; 13+ messages in thread
From: Martijn Dekker @ 2015-06-09  2:19 UTC (permalink / raw)
  To: zsh-workers

Peter Stephenson schreef op 08-06-15 om 10:44:
> With MULTIBYTE always on when available, the "EMULATE" flag becomes
> redundant.

I should have guessed there's a shell option for that. Still good to
have it on by default.

Between that and POSIX_ARGZERO, I've found (so far) that the following
puts zsh 5.0.8 in the most fully compliant POSIX mode:

emulate sh -o POSIX_ARGZERO -o MULTIBYTE

Thanks,

- M.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: In POSIX mode, ${#var} measures length in bytes, not characters
  2015-06-09  2:19   ` Martijn Dekker
@ 2015-06-09  8:35     ` Peter Stephenson
  2015-06-09 15:37       ` Bart Schaefer
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Stephenson @ 2015-06-09  8:35 UTC (permalink / raw)
  To: zsh-workers

On Tue, 9 Jun 2015 04:19:22 +0200
Martijn Dekker <martijn@inlv.org> wrote:
> Between that and POSIX_ARGZERO, I've found (so far) that the following
> puts zsh 5.0.8 in the most fully compliant POSIX mode:
> 
> emulate sh -o POSIX_ARGZERO -o MULTIBYTE

I can't remember why POSIX_ARGZERO isn't on, or even if there was
actually a reason.

pws


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: In POSIX mode, ${#var} measures length in bytes, not characters
  2015-06-09  8:35     ` Peter Stephenson
@ 2015-06-09 15:37       ` Bart Schaefer
  2015-06-09 15:43         ` Peter Stephenson
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Schaefer @ 2015-06-09 15:37 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 444 bytes --]

On Tuesday, June 9, 2015, Peter Stephenson <p.stephenson@samsung.com> wrote:
>
>
> I can't remember why POSIX_ARGZERO isn't on, or even if there was
> actually a reason.
>
>
The doc actually does explain this, for once.

For compatibility with previous versions of the shell, emulations use
NO_FUNCTION_ARGZERO instead of POSIX_ARGZERO, which may result in
unexpected scoping of $0 if the emulation mode is changed inside a function
or script.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: In POSIX mode, ${#var} measures length in bytes, not characters
  2015-06-09 15:37       ` Bart Schaefer
@ 2015-06-09 15:43         ` Peter Stephenson
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Stephenson @ 2015-06-09 15:43 UTC (permalink / raw)
  To: zsh-workers

On Tue, 9 Jun 2015 08:37:05 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Tuesday, June 9, 2015, Peter Stephenson <p.stephenson@samsung.com> wrote:
> >
> >
> > I can't remember why POSIX_ARGZERO isn't on, or even if there was
> > actually a reason.
> >
> >
> The doc actually does explain this, for once.
> 
> For compatibility with previous versions of the shell, emulations use
> NO_FUNCTION_ARGZERO instead of POSIX_ARGZERO, which may result in
> unexpected scoping of $0 if the emulation mode is changed inside a function
> or script.

So we'd have to special-case the top-level emulation on the assumption
once a sh, always a sh.  Not clear if it's worth it.

pws


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: In POSIX mode, ${#var} measures length in bytes, not characters
  2015-06-07  2:21   ` Martijn Dekker
@ 2015-06-09 17:49     ` Martijn Dekker
  0 siblings, 0 replies; 13+ messages in thread
From: Martijn Dekker @ 2015-06-09 17:49 UTC (permalink / raw)
  To: zsh-workers

Martijn Dekker schreef op 07-06-15 om 04:21:
> Yes, dash and pdksh/mksh/posh unfortunately have this bug, too.

Actually, mksh fixed it (though other pdksh variants haven't). Sorry for
the oversight.

- M.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: In POSIX mode, ${#var} measures length in bytes, not characters
  2015-06-08  8:44 ` Peter Stephenson
  2015-06-09  2:19   ` Martijn Dekker
@ 2015-06-11 11:23   ` Daniel Shahaf
  2015-06-11 11:33     ` Peter Stephenson
  1 sibling, 1 reply; 13+ messages in thread
From: Daniel Shahaf @ 2015-06-11 11:23 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: zsh-workers

Peter Stephenson wrote on Mon, Jun 08, 2015 at 09:44:20 +0100:
> diff --git a/Src/options.c b/Src/options.c
> index 3e3e074..78f603d 100644
> --- a/Src/options.c
> +++ b/Src/options.c
> @@ -192,7 +192,7 @@ static struct optname optns[] = {
>  {{NULL, "monitor",	      OPT_SPECIAL},		 MONITOR},
>  {{NULL, "multibyte",
>  #ifdef MULTIBYTE_SUPPORT
> -			      OPT_EMULATE|OPT_ZSH|OPT_CSH|OPT_KSH
> +			      OPT_ALL
>  #else
>  			      0
>  #endif

Update zshoptions.yo?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: In POSIX mode, ${#var} measures length in bytes, not characters
  2015-06-11 11:23   ` Daniel Shahaf
@ 2015-06-11 11:33     ` Peter Stephenson
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Stephenson @ 2015-06-11 11:33 UTC (permalink / raw)
  To: zsh-workers

On Thu, 11 Jun 2015 11:23:37 +0000
Daniel Shahaf <d.s@daniel.shahaf.name> wrote:
> Update zshoptions.yo?

Yes, I've been meaning to.

pws

diff --git a/Doc/Zsh/options.yo b/Doc/Zsh/options.yo
index fa54024..db9b18b 100644
--- a/Doc/Zsh/options.yo
+++ b/Doc/Zsh/options.yo
@@ -634,7 +634,7 @@ pindex(NO_MULTIBYTE)
 pindex(NOMULTIBYTE)
 cindex(characters, multibyte, in expansion and globbing)
 cindex(multibyte characters, in expansion and globbing)
-item(tt(MULTIBYTE) <C> <K> <Z>)(
+item(tt(MULTIBYTE) <D>)(
 Respect multibyte characters when found in strings.
 When this option is set, strings are examined using the
 system library to determine how many bytes form a character, depending
@@ -642,10 +642,8 @@ on the current locale.  This affects the way characters are counted in
 pattern matching, parameter values and various delimiters.
 
 The option is on by default if the shell was compiled with
-tt(MULTIBYTE_SUPPORT) except in tt(sh) emulation; otherwise it is off by
-default and has no effect if turned on.  The mode is off in tt(sh)
-emulation for compatibility but for interactive use may need to be
-turned on if the terminal interprets multibyte characters.
+tt(MULTIBYTE_SUPPORT); otherwise it is off by default and has no effect
+if turned on.
 
 If the option is off a single byte is always treated as a single
 character.  This setting is designed purely for examining strings


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-06-11 11:33 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-07  0:23 In POSIX mode, ${#var} measures length in bytes, not characters Martijn Dekker
2015-06-07  0:34 ` ZyX
2015-06-07  1:23   ` Bart Schaefer
2015-06-07  1:27     ` ZyX
2015-06-07  2:21   ` Martijn Dekker
2015-06-09 17:49     ` Martijn Dekker
2015-06-08  8:44 ` Peter Stephenson
2015-06-09  2:19   ` Martijn Dekker
2015-06-09  8:35     ` Peter Stephenson
2015-06-09 15:37       ` Bart Schaefer
2015-06-09 15:43         ` Peter Stephenson
2015-06-11 11:23   ` Daniel Shahaf
2015-06-11 11:33     ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).