From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 24274 invoked by alias); 28 May 2015 22:16:12 -0000 Mailing-List: contact zsh-users-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Users List List-Post: List-Help: X-Seq: 20234 Received: (qmail 23467 invoked from network); 28 May 2015 22:16:08 -0000 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_LOW autolearn=ham autolearn_force=no version=3.4.0 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=content-transfer-encoding:content-type :date:from:message-id:mime-version:subject:to:x-sasl-enc :x-sasl-enc; s=smtpout; bh=VdmKlGl8dks/BQGFApgnN4fpF/c=; b=LgOI0 hXYijWVmqu6U/3mssUfh3YkI0KvdWCRN+ZcPDbAPZt1jcak+1v/Pv0iN356BjYyf iGYyZ9cY7S80Zmr3DrayOs/6Vtg3SJJyc363dhaCB6x3kn7+1IFYe2w/2diEXfxq ObPcEbaRLcX/ktALpUiLchXEV9xAWlSNJfQ37A= X-Sasl-enc: 50T9Mcqr/OgRxQ08cI7kPrATVZR8ejxgyRZirmlo+tVq 1432850737 Message-ID: <5567912F.9040004@pobox.com> Date: Thu, 28 May 2015 18:05:35 -0400 From: Andrew Janke User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Zsh Users Subject: Three-byte UTF-8 chars and $functions Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Hi, ZSH users and workers, I'm seeing some odd behavior related to multibyte UTF-8 characters that I don't understand. Maybe one of you could help me figure out? I don't know if it's user error, incorrect behavior expectations on my part, or an actual character handling issue. The deal: If I define a function that contains a multibyte UTF-8 character that's three bytes long, the function itself and its representation from `which` work as expected. But if I grab the function definition to a string using `str="${functions[funcname]}"`, then that multibyte character seems to get garbled and come back as 3 characters, the bytes of which do not look like the original UTF-8 bytes. The ${#str} operator reports a different string length than I expected. Is there something going on with character encodings and $functions? This sounds almost like it's getting re-encoded in a different encoding. Or maybe my expectation that ${functions[func]} can be extracted directly to a string is wrong? Here's a zsh script that will reproduce the problem. This is in Unicode, and this email message should be encoded in UTF-8, but I'm not sure it won't get messed up on the way to the list. This code listing contains exactly one non-ASCII character, the character echoed by function foo(). It is U+25B8. The function should look exactly like "function bar () { echo 'x' }", but with the character "x" replaced by U+25B8 (as a single character, not an escape sequence of some sort). Thanks in advance for any assistance or info you can provide. #!/bin/zsh # weird_unicode_func.zsh # # Shows weird behavior of multibyte UTF-8 Unicode characters # in function definitions exposed through $functions export LC_ALL=en_US.UTF-8 uname -a echo zsh $ZSH_VERSION locale # Function that echoes U+25B8 BLACK RIGHT-POINTING SMALL TRIANGLE # (Chosen because that's single-width, at least in Menlo/Meslo) function foo () { echo '▸' } # And a function that echoes a normal character function bar () { echo 'x' } # I'd think these *should* be the same length # And they are, for "which": 21 chars each, on OS X # And their output should be 2 chars each which foo echo `which foo | wc -m` chars echo "foo output: $(foo | wc -m) chars $(foo | wc -c) bytes" which bar echo `which bar | wc -m` chars echo "bar output: $(bar | wc -m) chars $(bar | wc -c) bytes" # But capture their string representations... foo_str="${functions[foo]}" bar_str="${functions[bar]}" # ...and in those strings, I see foo as 2 chars longer than bar # (11 chars vs 9 chars) echo '${functions[foo]} is:' echo "$foo_str" echo ${#foo_str} chars # And wc doesn't even like it; looks like it's invalid UTF-8? echo -n "$foo_str" | wc -m echo '${functions[bar]} is:' echo "$bar_str" echo ${#bar_str} chars echo # Plus, re-evaling it to define another function ends up # producing weird output when that function is run eval "function qux { ${functions[foo]} }" which qux echo qux output: qux echo $(qux | wc -c) bytes qux | wc -m # end of script And here's the output I get, on Mac OS X 10.9.5, using either zsh 5.0.2 (shipped with OS X) or 5.0.7 (installed via Homebrew). If the output gets garbled, know that I'm seeing the triangle as a single charater in the output of `foo` and `which foo`, but once it comes out of $functions, I'm seeing "?~?", where "?" is actually the "unsupported/invalid/no-glyph character" placeholder. I see almost the same behavior on Debian Linux 7 with zsh 4.3.7. zsh's output seems the same, but there, wc (GNU wc 8.13) is happy with the later outputs, seeing them as 4 bytes and 2 chars. On Windows 7/Cygwin with zsh 5.0.7 and GNU wc 8.23, I see the same behavior as on Debian. I get the same behavior if I don't explicitly assign LC_ALL; I'm just doing that for consistency (and not sure if I should). I'm running zsh with an empty ~/.zshrc and all the default settings (AFAIK). eilonwy% zsh weird_unicode_func.zsh Darwin eilonwy.local 13.4.0 Darwin Kernel Version 13.4.0: Wed Mar 18 16:20:14 PDT 2015; root:xnu-2422.115.14~1/RELEASE_X86_64 x86_64 zsh 5.0.2 LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL="en_US.UTF-8" foo () { echo '▸' } 21 chars foo output: 2 chars 4 bytes bar () { echo 'x' } 21 chars bar output: 2 chars 2 bytes ${functions[foo]} is: echo '�~�' 11 chars wc: stdin: Illegal byte sequence 11 ${functions[bar]} is: echo 'x' 9 chars qux () { echo '�~�' } qux output: �~� 4 bytes wc: stdin: Illegal byte sequence 4 eilonwy% Cheers, Andrew Janke janke@pobox.com