From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 4672 invoked by alias); 5 Oct 2016 11:48:58 -0000 Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Workers List List-Post: List-Help: X-Seq: 39569 Received: (qmail 28559 invoked from network); 5 Oct 2016 11:48:58 -0000 X-Qmail-Scanner-Diagnostics: from cventin.lip.ens-lyon.fr by f.primenet.com.au (envelope-from , uid 7791) with qmail-scanner-2.11 (clamdscan: 0.99.2/21882. spamassassin: 3.4.1. Clear:RC:0(140.77.13.17):SA:0(0.0/5.0):. Processed in 0.468884 secs); 05 Oct 2016 11:48:58 -0000 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=unavailable autolearn_force=no version=3.4.1 X-Envelope-From: vincent@vinc17.net X-Qmail-Scanner-Mime-Attachments: | X-Qmail-Scanner-Zip-Files: | Received-SPF: none (ns1.primenet.com.au: domain at vinc17.net does not designate permitted sender hosts) Date: Wed, 5 Oct 2016 13:48:48 +0200 From: Vincent Lefevre To: zsh-workers@zsh.org Subject: zsh generates invalid UTF-8 encoding in the history Message-ID: <20161005114848.GA1125@cventin.lip.ens-lyon.fr> Mail-Followup-To: zsh-workers@zsh.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Mailer-Info: https://www.vinc17.net/mutt/ User-Agent: Mutt/1.7.0-6804-vl-r91193 (2016-10-01) With Debian's zsh 5.2-5 + some patches, when I execute commands with some particular Unicode characters, the UTF-8 sequences are rewritten incorrectly in the history. For instance: cventin:~> unicode ─ U+2500 BOX DRAWINGS LIGHT HORIZONTAL UTF-8: e2 94 80 UTF-16BE: 2500 Decimal: ─ Octal: \022400 ─ Category: So (Symbol, Other) Unicode block: 2500..257F; Box Drawing Bidi: ON (Other Neutrals) But in the history, instead of getting e2 94 80, I get: e2 83 b4 80. Concerning "e2 83 b4 80": cventin:~> unicode --fromcp utf-8 -x e283b4 U+20F4 - No such unicode character name in database UTF-8: e2 83 b4 UTF-16BE: 20f4 Decimal: ⃴ Octal: \020364 ⃴ (⃴) Uppercase: 20F4 Category: Cn (Other, Not Assigned) Unicode block: 20D0..20FF; Combining Diacritical Marks for Symbols and the 80 on its own is not a valid UTF-8 sequence. This breaks various tools processing the history (grep, lesspipe, etc.), first because the expected character is no longer present, also because of invalid UTF-8, which is not regarded as a character. For instance: cventin:~> grep -av '^.*$' .zhistory | tail -n 1 | hd 00000000 3a 20 31 34 37 35 36 36 36 34 31 38 3a 30 3b 75 |: 1475666418:0;u| 00000010 6e 69 63 6f 64 65 20 e2 83 b4 80 0a |nicode .....| 0000001c -- Vincent Lefèvre - Web: 100% accessible validated (X)HTML - Blog: Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)