From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 27009 invoked by alias); 8 Mar 2011 06:52:25 -0000 Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Workers List List-Post: List-Help: X-Seq: 28870 Received: (qmail 11856 invoked from network); 8 Mar 2011 06:52:20 -0000 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received-SPF: none (ns1.primenet.com.au: domain at spodhuis.org does not designate permitted sender hosts) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=spodhuis.org; s=d200912; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID:Subject:To:From:Date; bh=HHmqc78NjqJvnYX/szGQajVfl0RFThMwNAwhfV9BwtQ=; b=rguZAdKlfwwcPH1v+LgmN6dJgT599d8dB5hYxNtXdVje6NJGtrb8u3Cb8jFD1lXL4515aP4suRgIZd2e7S3dIW4FoAYg0mE2MEWRvY4M9dDACrQrO/LY2pwPicNXpeF6irm6N/KZws02Fzvl/4OfcQtUINRaoPGuI9ssm+Vde2Q=; Date: Tue, 8 Mar 2011 01:52:16 -0500 From: Phil Pennock To: zsh-workers@zsh.org Subject: UTF-8 and PCRE and metafy Message-ID: <20110308065216.GB79682@redoubt.spodhuis.org> Mail-Followup-To: zsh-workers@zsh.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit 4.3.11 with rematch_pcre: % [[ 'foo→bar' =~ ^f.* ]] zsh: pcre_exec() error: -10 Same with -pcre-match % locale charmap UTF-8 Error -10 is PCRE_ERROR_BADUTF8. In the pcre.c module, we explicitly enable PCRE_UTF8 if UTF8 is in effect and supported. By the: zwarn("pcre_exec() error: %d", r); I shoved in a couple more zwarn()s to confirm that the string is in non-meta form: zwarn("pcre_exec() error: %d", r); zwarn("lhstr: %s", lhstr); zwarn("rhre: /%s/", rhre); → zsh: pcre_exec() error: -10 zsh: lhstr: foo→bar zsh: rhre: /^f.*/ pcretest(1): % pcretest PCRE version 8.12 2011-01-15 re> /^f.*/ data> foo→bar 0: foo\xe2\x86\x92bar Okay, so as long as the char is making it through intact as UTF-8 then PCRE should be handling it. Debug each char in lhstr as an int, find it's *not* in non-meta form -- why does it print just fine, then? :( % [[ 'foo→bar' =~ ^f.* ]] zsh: pcre_exec() error: -10 zsh: lhstr: foo→bar zsh: lhstr/%l: foo→bar zsh: rhre: /^f.*/ zsh: utf-8 enabled? 1 zsh: lhstr char* item: 102 zsh: lhstr char* item: 111 zsh: lhstr char* item: 111 zsh: lhstr char* item: -30 zsh: lhstr char* item: -125 zsh: lhstr char* item: -90 zsh: lhstr char* item: -125 zsh: lhstr char* item: -78 zsh: lhstr char* item: 98 zsh: lhstr char* item: 97 zsh: lhstr char* item: 114 So after line 336 of pcre.c I add: unmetafy(lhstr, NULL); Test: % unset preexec_functions ; unfunction precmd % [[ 'foo→bar' =~ ^f.* ]] ; print -l $? $MATCH foo $match pattern.c:1403: BUG: - missing from numeric glob 0 foo?^