From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <zsh-workers-return-39569-mason-zsh=primenet.com.au@zsh.org>
Received: (qmail 4672 invoked by alias); 5 Oct 2016 11:48:58 -0000
Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm
Precedence: bulk
X-No-Archive: yes
List-Id: Zsh Workers List <zsh-workers.zsh.org>
List-Post: <mailto:zsh-workers@zsh.org>
List-Help: <mailto:zsh-workers-help@zsh.org>
X-Seq: 39569
Received: (qmail 28559 invoked from network); 5 Oct 2016 11:48:58 -0000
X-Qmail-Scanner-Diagnostics: from cventin.lip.ens-lyon.fr by f.primenet.com.au (envelope-from <vincent@vinc17.net>, uid 7791) with qmail-scanner-2.11 
 (clamdscan: 0.99.2/21882. spamassassin: 3.4.1.  
 Clear:RC:0(140.77.13.17):SA:0(0.0/5.0):. 
 Processed in 0.468884 secs); 05 Oct 2016 11:48:58 -0000
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on f.primenet.com.au
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=unavailable
	autolearn_force=no version=3.4.1
X-Envelope-From: vincent@vinc17.net
X-Qmail-Scanner-Mime-Attachments: |
X-Qmail-Scanner-Zip-Files: |
Received-SPF: none (ns1.primenet.com.au: domain at vinc17.net does not designate permitted sender hosts)
Date: Wed, 5 Oct 2016 13:48:48 +0200
From: Vincent Lefevre <vincent@vinc17.net>
To: zsh-workers@zsh.org
Subject: zsh generates invalid UTF-8 encoding in the history
Message-ID: <20161005114848.GA1125@cventin.lip.ens-lyon.fr>
Mail-Followup-To: zsh-workers@zsh.org
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
X-Mailer-Info: https://www.vinc17.net/mutt/
User-Agent: Mutt/1.7.0-6804-vl-r91193 (2016-10-01)

With Debian's zsh 5.2-5 + some patches, when I execute commands with
some particular Unicode characters, the UTF-8 sequences are rewritten
incorrectly in the history. For instance:

cventin:~> unicode ─
U+2500 BOX DRAWINGS LIGHT HORIZONTAL
UTF-8: e2 94 80 UTF-16BE: 2500 Decimal: &#9472; Octal: \022400
─
Category: So (Symbol, Other)
Unicode block: 2500..257F; Box Drawing
Bidi: ON (Other Neutrals)

But in the history, instead of getting e2 94 80, I get: e2 83 b4 80.
Concerning "e2 83 b4 80":

cventin:~> unicode --fromcp utf-8 -x e283b4
U+20F4  - No such unicode character name in database
UTF-8: e2 83 b4 UTF-16BE: 20f4 Decimal: &#8436; Octal: \020364
⃴ (⃴)
Uppercase: 20F4
Category: Cn (Other, Not Assigned)
Unicode block: 20D0..20FF; Combining Diacritical Marks for Symbols

and the 80 on its own is not a valid UTF-8 sequence.

This breaks various tools processing the history (grep, lesspipe,
etc.), first because the expected character is no longer present,
also because of invalid UTF-8, which is not regarded as a character.
For instance:

cventin:~> grep -av '^.*$' .zhistory | tail -n 1 | hd
00000000  3a 20 31 34 37 35 36 36  36 34 31 38 3a 30 3b 75  |: 1475666418:0;u|
00000010  6e 69 63 6f 64 65 20 e2  83 b4 80 0a              |nicode .....|
0000001c

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)