caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Mauricio Fernandez <mfp@acm.org>
To: caml-list@inria.fr
Subject: Re: [Caml-list] ocaml-pcre and UTF-8
Date: Thu, 16 Feb 2012 11:18:13 +0100	[thread overview]
Message-ID: <20120216101813.GA11489@NANA.localdomain> (raw)
In-Reply-To: <32FE3555-556A-43AC-8B1B-9A4AF08DBA02@strauss-acoustics.ch>

On Thu, Feb 16, 2012 at 10:29:30AM +0100, Philippe Strauss wrote:
> Hello caml'ers,
> 
> How do I convince PCRE to be UTF-8 friendly? example:
> 
> --
> open Pcre
> 
> external show : 'a -> string = "%show"

As an aside: where did you get this external from? I had to write a proper
show function on 3.12.0 in order to compile your example.

> let recomp = regexp ~flags:[`UTF8; `CASELESS]
> 
> let res_w = "(*UTF8)^(\w+)$"
                       =====
It would be \\w if anything, but the pcre manual warns that

       6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
       test characters  of any  code  value, but, by default, the characters
       that PCRE recognizes as digits, spaces, or word characters remain the
       same set as before, all with values  less  than  256.  This remains
       true  even when PCRE is built to include Unicode property support,
       because to do otherwise would slow down PCRE in many common cases. Note
       in particular that this applies to  \b and \B, because they are defined
       in terms of \w and \W. If you really want to test for a wider sense of,
       say, "digit", you can use explicit Unicode property tests  such  as
       \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the
       character escapes work is changed so that Unicode properties are used
       to determine which characters  match.  There are more details in the
       section on generic character types in the pcrepattern documentation.

and pcrepattern lists this Unicode property:

  Xwd   Any Perl "word" character
  
so, given a suitable show function, both 

  let res_w = "^(\\p{Xwd}+)$"
and
  let res_w = "(*UCP)^(\\w+)$"

yield

./pcre_utf 
config_utf8=true
[|Some blurb|]
[|Some toxicité|]
[|Some velléités|]
[|Some à|]
[|Some où|]
[|Some über|]
Not_found was raised on "marie-jeanne" :-(

('-' not being a "word character" in my locale)

-- 
Mauricio Fernandez  -   http://eigenclass.org

  reply	other threads:[~2012-02-16 10:18 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-16  9:29 Philippe Strauss
2012-02-16 10:18 ` Mauricio Fernandez [this message]
2012-02-16 10:18 ` Philippe Strauss

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120216101813.GA11489@NANA.localdomain \
    --to=mfp@acm.org \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).