From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 12728 invoked by alias); 18 Sep 2012 08:52:07 -0000 Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Workers List List-Post: List-Help: X-Seq: 30676 Received: (qmail 15271 invoked from network); 18 Sep 2012 08:52:04 -0000 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_LOW, SPF_HELO_PASS autolearn=ham version=3.3.2 Received-SPF: none (ns1.primenet.com.au: domain at csr.com does not designate permitted sender hosts) Date: Tue, 18 Sep 2012 09:51:45 +0100 From: Peter Stephenson To: Subject: Re: PATCH: PCRE support for embedded NUL characters Message-ID: <20120918095145.76dabc4b@pwslap01u.europe.root.pri> In-Reply-To: <20120917190422.GA41017@redoubt.spodhuis.org> References: <20120916125015.GA87764@redoubt.spodhuis.org> <20120917095727.23896b8b@pwslap01u.europe.root.pri> <20120917190422.GA41017@redoubt.spodhuis.org> Organization: Cambridge Silicon Radio X-Mailer: Claws Mail 3.7.9 (GTK+ 2.22.0; i386-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.101.10.18] X-Scanned-By: MailControl 9446.0 (www.mailcontrol.com) on 10.68.0.115 On Mon, 17 Sep 2012 15:04:23 -0400 Phil Pennock wrote: > Yeah, but correlating offsets in unmetafied strings to the metafied > strings for then counting is non-trivial (or so it seems to me). It's not so difficult: we already do most of this conversion for other similar cases of pattern matching, where we need to convert offsets in octets to characters, the only difference being the metafication which just means the loop over the characters is slightly different. In fact, it's if anything marginally easier since the metafication is a pure zsh invention. > And wcwidth() tells how many display cells are needed for a given > character, assuming a monospace layout. For this, instead, mblen() is > needed, on a character-by-character basis. Given that mblen() is C99, I > opted to avoid it, and implement this just for UTF-8 with bit-pattern > examination to quickly count past characters. We only initialise PCRE > for wide characters with UTF-8. I've no idea how much effort we want to > put into supporting non-UTF-8 wide-character PCRE across multiple OSes. Doing it just for UTF-8 is incompatible with the rest of the shell. It should be possible to do it similarly to mb_metastrlen() in utils.c. Basically the only difference is using an explicit length rather than null termination, plus not having an internal test for Meta characters. -- Peter Stephenson Software Engineer Tel: +44 (0)1223 692070 Cambridge Silicon Radio Limited Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, UK Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom More information can be found at www.csr.com. Follow CSR on Twitter at http://twitter.com/CSR_PLC and read our blog at www.csr.com/blog