From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 18406 invoked by alias); 17 Sep 2012 08:57:55 -0000 Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Workers List List-Post: List-Help: X-Seq: 30672 Received: (qmail 12581 invoked from network); 17 Sep 2012 08:57:42 -0000 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_LOW, SPF_HELO_PASS autolearn=ham version=3.3.2 Received-SPF: none (ns1.primenet.com.au: domain at csr.com does not designate permitted sender hosts) Date: Mon, 17 Sep 2012 09:57:27 +0100 From: Peter Stephenson To: Subject: Re: PATCH: PCRE support for embedded NUL characters Message-ID: <20120917095727.23896b8b@pwslap01u.europe.root.pri> In-Reply-To: <20120916125015.GA87764@redoubt.spodhuis.org> References: <20120916125015.GA87764@redoubt.spodhuis.org> Organization: Cambridge Silicon Radio X-Mailer: Claws Mail 3.7.9 (GTK+ 2.22.0; i386-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.101.10.18] X-Scanned-By: MailControl 9446.0 (www.mailcontrol.com) on 10.68.0.115 On Sun, 16 Sep 2012 08:50:15 -0400 Phil Pennock wrote: > This patch does not touch the docs or code for non-PCRE. It just > changes PCRE: code, docs, tests. As I wrote this mail, I realised a > couple of issues which lead me to think I shouldn't just commit this as > is; there are open questions below for Peter/Bart. > > Moritz Bunkus reported a problem which boiled down to regular > expressions containing ASCII NUL characters; the support for UTF-8 > multibyte characters in regular expressions meant that we no longer > passed zsh's internal metafied forms to the regular expression > libraries, but this also meant that NUL characters in zsh now make it > down. There's no way I can see to deal with this for zsh/regex, but for > zsh/pcre we can hack around it. I'm not sure what the right answer is for zsh/regex, but documenting it will do for now. > More careful tracking of length, instead of using strlen(), lets us pass > the search space (haystack) down intact. For the regular expression, as > part of the unmetafy() I now check to see if strlen() doesn't match the > decoded length, in which case there's a NUL somewhere, and then all the > NULs get replaced with \x00 in the pattern. > > One thing that occurs to me now: what's the correct expectation if the > pattern contains two characters, "backslash NUL"? Before, that broke; > with this, it becomes \\x00 which won't match. \CHAR for a non-letter > CHAR should remove special handling and treat the character as itself. > Fixing this seems to require full string parsing in zsh with knowledge > of the regexp escape sequences. Document it as a limitation? You can just do a pre-scan of the whole string for backslashes. If there's a backslash followed by a non-NULL, skip checking that next character (which may itself be a backslash that's escaped); if there's a backslash followed by a NULL the backslash can go. It's such an unusual case it hardly seems worth it, though. > Another open question: are $mbegin/$mend offsets supposed to be in > octets or in characters? Given the MB_ prefix, I'm guessing I just > broke this and will need to fix it tomorrow, before commit, after I get > some sleep. Do we have a decent way to count the number of wide > characters in an unmetafied string which can contain NUL characters? They are characters. If the string is unmetafied you can skip the MB_METACHARLEN() stuff and use the mbrtowc()/WCWIDTH() library calls directly (WCWIDTH() is only defined in order to be able to replace an unusable wcwidth()), but a null probably needs to be a special case since I think the libraries assume it's a terminator. It looks like the existing pattern code uses metafied strings. -- Peter Stephenson Software Engineer Tel: +44 (0)1223 692070 Cambridge Silicon Radio Limited Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, UK Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom More information can be found at www.csr.com. Follow CSR on Twitter at http://twitter.com/CSR_PLC and read our blog at www.csr.com/blog