zsh-users
 help / color / mirror / code / Atom feed
* Three-byte UTF-8 chars and $functions
@ 2015-05-28 22:05 Andrew Janke
  2015-05-29 11:12 ` Peter Stephenson
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Janke @ 2015-05-28 22:05 UTC (permalink / raw)
  To: Zsh Users

Hi, ZSH users and workers,

I'm seeing some odd behavior related to multibyte UTF-8 characters that 
I don't understand. Maybe one of you could help me figure out? I don't 
know if it's user error, incorrect behavior expectations on my part, or 
an actual character handling issue.

The deal: If I define a function that contains a multibyte UTF-8 
character that's three bytes long, the function itself and its 
representation from `which` work as expected. But if I grab the function 
definition to a string using `str="${functions[funcname]}"`, then that 
multibyte character seems to get garbled and come back as 3 characters, 
the bytes of which do not look like the original UTF-8 bytes. The 
${#str} operator reports a different string length than I expected.

Is there something going on with character encodings and $functions? 
This sounds almost like it's getting re-encoded in a different encoding. 
Or maybe my expectation that ${functions[func]} can be extracted 
directly to a string is wrong?

Here's a zsh script that will reproduce the problem. This is in Unicode, 
and this email message should be encoded in UTF-8, but I'm not sure it 
won't get messed up on the way to the list. This code listing contains 
exactly one non-ASCII character, the character echoed by function foo(). 
It is U+25B8. The function should look exactly like "function bar () { 
echo 'x' }", but with the character "x" replaced by U+25B8 (as a single 
character, not an escape sequence of some sort).

Thanks in advance for any assistance or info you can provide.


#!/bin/zsh
# weird_unicode_func.zsh
#
# Shows weird behavior of multibyte UTF-8 Unicode characters
# in function definitions exposed through $functions

export LC_ALL=en_US.UTF-8

uname -a
echo zsh $ZSH_VERSION

locale

# Function that echoes U+25B8 BLACK RIGHT-POINTING SMALL TRIANGLE
# (Chosen because that's single-width, at least in Menlo/Meslo)
function foo () { echo '▸' }

# And a function that echoes a normal character
function bar () { echo 'x' }


# I'd think these *should* be the same length
# And they are, for "which": 21 chars each, on OS X
# And their output should be 2 chars each
which foo
echo `which foo | wc -m` chars
echo "foo output: $(foo | wc -m) chars $(foo | wc -c) bytes"
which bar
echo `which bar | wc -m` chars
echo "bar output: $(bar | wc -m) chars $(bar | wc -c) bytes"

# But capture their string representations...
foo_str="${functions[foo]}"
bar_str="${functions[bar]}"

# ...and in those strings, I see foo as 2 chars longer than bar
# (11 chars vs 9 chars)
echo '${functions[foo]} is:'
echo "$foo_str"
echo ${#foo_str} chars
# And wc doesn't even like it; looks like it's invalid UTF-8?
echo -n "$foo_str" | wc -m
echo '${functions[bar]} is:'
echo "$bar_str"
echo ${#bar_str} chars
echo

# Plus, re-evaling it to define another function ends up
# producing weird output when that function is run
eval "function qux { ${functions[foo]} }"
which qux
echo qux output:
qux
echo $(qux | wc -c) bytes
qux | wc -m

# end of script



And here's the output I get, on Mac OS X 10.9.5, using either zsh 5.0.2 
(shipped with OS X) or 5.0.7 (installed via Homebrew). If the output 
gets garbled, know that I'm seeing the triangle as a single charater in 
the output of `foo` and `which foo`, but once it comes out of 
$functions, I'm seeing "?~?", where "?" is actually the 
"unsupported/invalid/no-glyph character" placeholder.

I see almost the same behavior on Debian Linux 7 with zsh 4.3.7. zsh's 
output seems the same, but there, wc (GNU wc 8.13) is happy with the 
later outputs, seeing them as 4 bytes and 2 chars. On Windows 7/Cygwin 
with zsh 5.0.7 and GNU wc 8.23, I see the same behavior as on Debian.

I get the same behavior if I don't explicitly assign LC_ALL; I'm just 
doing that for consistency (and not sure if I should).

I'm running zsh with an empty ~/.zshrc and all the default settings (AFAIK).


eilonwy% zsh weird_unicode_func.zsh
Darwin eilonwy.local 13.4.0 Darwin Kernel Version 13.4.0: Wed Mar 18 
16:20:14 PDT 2015; root:xnu-2422.115.14~1/RELEASE_X86_64 x86_64
zsh 5.0.2
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
foo () {
     echo '▸'
}
21 chars
foo output:        2 chars        4 bytes
bar () {
     echo 'x'
}
21 chars
bar output:        2 chars        2 bytes
${functions[foo]} is:
     echo '�~�'
11 chars
wc: stdin: Illegal byte sequence
       11
${functions[bar]} is:
     echo 'x'
9 chars

qux () {
     echo '�~�'
}
qux output:
�~�
4 bytes
wc: stdin: Illegal byte sequence
        4
eilonwy%




Cheers,
Andrew Janke
janke@pobox.com


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Three-byte UTF-8 chars and $functions
  2015-05-28 22:05 Three-byte UTF-8 chars and $functions Andrew Janke
@ 2015-05-29 11:12 ` Peter Stephenson
  2015-06-06  6:40   ` Andrew Janke
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Stephenson @ 2015-05-29 11:12 UTC (permalink / raw)
  To: Zsh Users

On Thu, 28 May 2015 18:05:35 -0400
Andrew Janke <janke@pobox.com> wrote:
> I'm seeing some odd behavior related to multibyte UTF-8 characters that 
> I don't understand. Maybe one of you could help me figure out? I don't 
> know if it's user error, incorrect behavior expectations on my part, or 
> an actual character handling issue.

It's what we call "a bug".

You may not have heard the expression before as it doesn't happen that
much round here.

Sigh.

> function foo () { echo '▸' }
> 
> # But capture their string representations...
> foo_str="${functions[foo]}"
>
> # ...and in those strings, I see foo as 2 chars longer than bar

You know when you see something funny in a function and you're too
cowardly to change it so you just stick a comment about it there and
forget about it?  This happens.

See if there are still any remaining problems.

pws

diff --git a/Src/Modules/parameter.c b/Src/Modules/parameter.c
index 55157a9..04d4485 100644
--- a/Src/Modules/parameter.c
+++ b/Src/Modules/parameter.c
@@ -410,11 +410,6 @@ getfunction(UNUSED(HashTable ht), const char *name, int dis)
 	    } else
 		h = dyncat(start, t);
 	    zsfree(t);
-	    /*
-	     * TBD: Is this unmetafy correct?  Surely as this
-	     * is a parameter value it stays metafied?
-	     */
-	    unmetafy(h, NULL);
 
 	    if (shf->redir) {
 		t = getpermtext(shf->redir, NULL, 1);


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Three-byte UTF-8 chars and $functions
  2015-05-29 11:12 ` Peter Stephenson
@ 2015-06-06  6:40   ` Andrew Janke
  0 siblings, 0 replies; 3+ messages in thread
From: Andrew Janke @ 2015-06-06  6:40 UTC (permalink / raw)
  To: Peter Stephenson, Zsh Users

That fixes it for me on OS X 10.9. (Sorry I didn't see this earlier; 
victim of an overzealous mail filter.)

Thanks!

Andrew

On 5/29/15 7:12 AM, Peter Stephenson wrote:
> On Thu, 28 May 2015 18:05:35 -0400
> Andrew Janke <janke@pobox.com> wrote:
>> I'm seeing some odd behavior related to multibyte UTF-8 characters that
>> I don't understand. Maybe one of you could help me figure out? I don't
>> know if it's user error, incorrect behavior expectations on my part, or
>> an actual character handling issue.
> It's what we call "a bug".
>
> You may not have heard the expression before as it doesn't happen that
> much round here.
>
> Sigh.
>
>> function foo () { echo '▸' }
>>
>> # But capture their string representations...
>> foo_str="${functions[foo]}"
>>
>> # ...and in those strings, I see foo as 2 chars longer than bar
> You know when you see something funny in a function and you're too
> cowardly to change it so you just stick a comment about it there and
> forget about it?  This happens.
>
> See if there are still any remaining problems.
>
> pws
>
> diff --git a/Src/Modules/parameter.c b/Src/Modules/parameter.c
> index 55157a9..04d4485 100644
> --- a/Src/Modules/parameter.c
> +++ b/Src/Modules/parameter.c
> @@ -410,11 +410,6 @@ getfunction(UNUSED(HashTable ht), const char *name, int dis)
>   	    } else
>   		h = dyncat(start, t);
>   	    zsfree(t);
> -	    /*
> -	     * TBD: Is this unmetafy correct?  Surely as this
> -	     * is a parameter value it stays metafied?
> -	     */
> -	    unmetafy(h, NULL);
>   
>   	    if (shf->redir) {
>   		t = getpermtext(shf->redir, NULL, 1);


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-06-06  6:51 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-28 22:05 Three-byte UTF-8 chars and $functions Andrew Janke
2015-05-29 11:12 ` Peter Stephenson
2015-06-06  6:40   ` Andrew Janke

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).