caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* features of PCRE-OCaml
@ 2000-12-06  0:51 Markus Mottl
  2000-12-07 16:01 ` John Max Skaller
  2000-12-07 20:17 ` features of PCRE-OCaml Miles Egan
  0 siblings, 2 replies; 18+ messages in thread
From: Markus Mottl @ 2000-12-06  0:51 UTC (permalink / raw)
  To: OCAML

Hello,

it seems that many people hadn't yet learnt about PCRE-OCaml (the
OCaml-interface to the PCRE-library) and have asked for more information
on the advantages as compared to the Str-library (or to Perl).

Here is a list of features as taken from the README:

  * The PCRE-library by Philip Hazel has been under development for
    quite some time now and is fairly advanced and stable. It implements
    just about all of the convenient functionality of regular expressions
    as one can find them in PERL. The higher-level functions written
    in OCaml (split, replace), too, are compatible to the corresponding
    PERL-functions (to the extent that OCaml allows). Most people find
    the syntax of PERL-style regular expressions more straightforward
    than the Emacs-style one used in the "Str"-module.

  * In contrast to PERL, the library creates DFAs (deterministic finite
    automata) instead of NFAs (nondeterministic finite automata). DFAs
    generally allow much faster pattern matching, because they never
    need to backtrack. Especially patterns with many alternations can
    see a great speedup.

  * It is reentrant - and thus thread safe. This is not the case with
    the "Str"-module of OCaml, which builds on the GNU "regex"-library.
    Using reentrant libraries also means more convenience for
    programmers. They do not have to reason about states in which the
    library might be in.

  * The high-level functions for replacement and substitution, they are
    all implemented in OCaml, are much faster than the ones of the
    "Str"-module. In fact, when compiled to native code, they even seem
    to be significantly faster than those of PERL (PERL is written in C).

    Somebody reported to me that he had tested OCaml with PCRE-OCaml
    against PERL and Python with several 100MB data that had to be
    matched/manipulated. Trusting his claims, the overall speed of the
    OCaml-version (native code) was 15 times faster than Perl and 45
    times faster than Python, which is probably also due to the high
    quality of the OCaml-compiler.

  * You can rely on the data returned being unique. In other terms:
    if the result of a function is a string, you can safely use
    destructive updates on it without having to fear side effects.

  * The interface to the library makes use of labels and default
    arguments to give you a high degree of programming comfort.

I hope this answers most questions!

Best regards,
Markus Mottl

-- 
Markus Mottl, mottl@miss.wu-wien.ac.at, http://miss.wu-wien.ac.at/~mottl



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-06  0:51 features of PCRE-OCaml Markus Mottl
@ 2000-12-07 16:01 ` John Max Skaller
  2000-12-07 16:32   ` Markus Mottl
  2000-12-14 17:35   ` unicode support Nickolay Semyonov
  2000-12-07 20:17 ` features of PCRE-OCaml Miles Egan
  1 sibling, 2 replies; 18+ messages in thread
From: John Max Skaller @ 2000-12-07 16:01 UTC (permalink / raw)
  To: Markus Mottl; +Cc: OCAML

Markus Mottl wrote:

>     Somebody reported to me that he had tested OCaml with PCRE-OCaml
>     against PERL and Python with several 100MB data that had to be
>     matched/manipulated. Trusting his claims, the overall speed of the
>     OCaml-version (native code) was 15 times faster than Perl and 45
>     times faster than Python, which is probably also due to the high
>     quality of the OCaml-compiler.

Funny. Python 1.5.2 used the _same_ C library by Philip Hazel. :-)
Given the fact this library builds DFA's instead of NFA's, Python
ought to be faster than Perl. :-)

Note also, Python 2.0 uses a modified library which does something
PCRE-OCaml cannot: it works with Unicode characters (supposedly).

I do have a question though: is it possible to build a Str compatible
interface to PCRE-Ocaml, and then make them both a standard part
of the distribution? [The idea of a functional language using
non-reentrant libraries seems slightly absurd to me :-]

-- 
John (Max) Skaller, mailto:skaller@maxtal.com.au
10/1 Toxteth Rd Glebe NSW 2037 Australia voice: 61-2-9660-0850
checkout Vyper http://Vyper.sourceforge.net
download Interscript http://Interscript.sourceforge.net



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-07 16:01 ` John Max Skaller
@ 2000-12-07 16:32   ` Markus Mottl
  2000-12-07 17:08     ` John Max Skaller
  2000-12-14 17:35   ` unicode support Nickolay Semyonov
  1 sibling, 1 reply; 18+ messages in thread
From: Markus Mottl @ 2000-12-07 16:32 UTC (permalink / raw)
  To: John Max Skaller; +Cc: OCAML

On Fri, 08 Dec 2000, John Max Skaller wrote:
> Funny. Python 1.5.2 used the _same_ C library by Philip Hazel. :-)
> Given the fact this library builds DFA's instead of NFA's, Python
> ought to be faster than Perl. :-)

Well, the matching engine is not everything... ;)

> Note also, Python 2.0 uses a modified library which does something
> PCRE-OCaml cannot: it works with Unicode characters (supposedly).

To my knowledge, Phil Hazel is working on support for this. Unless the
PCRE-library supports Unicode (and unless OCaml does ;), there is not
much one can do about it...

> I do have a question though: is it possible to build a Str compatible
> interface to PCRE-Ocaml, and then make them both a standard part
> of the distribution? [The idea of a functional language using
> non-reentrant libraries seems slightly absurd to me :-]

I am not sure whether it is really necessary to have a Str compatible
interface: the regular expressions are already different so exchanging
the old against the new library would break code anyway.

- Markus Mottl

-- 
Markus Mottl, mottl@miss.wu-wien.ac.at, http://miss.wu-wien.ac.at/~mottl



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-07 16:32   ` Markus Mottl
@ 2000-12-07 17:08     ` John Max Skaller
  2000-12-08  0:03       ` Markus Mottl
  2000-12-08  9:19       ` Alain Frisch
  0 siblings, 2 replies; 18+ messages in thread
From: John Max Skaller @ 2000-12-07 17:08 UTC (permalink / raw)
  To: Markus Mottl; +Cc: OCAML

Markus Mottl wrote:
> 
> On Fri, 08 Dec 2000, John Max Skaller wrote:
> > Funny. Python 1.5.2 used the _same_ C library by Philip Hazel. :-)
> > Given the fact this library builds DFA's instead of NFA's, Python
> > ought to be faster than Perl. :-)
> 
> Well, the matching engine is not everything... ;)

	It is for code doing extensive matching of long strings
against a single pattern: everything else should be dwarfed
by the match time.

> > Note also, Python 2.0 uses a modified library which does something
> > PCRE-OCaml cannot: it works with Unicode characters (supposedly).
> 
> To my knowledge, Phil Hazel is working on support for this. Unless the
> PCRE-library supports Unicode (and unless OCaml does ;), there is not
> much one can do about it...

	What? You mean it isn't generic enough to just change
'char' to 'short' and recompile?  [:-)]
 
> I am not sure whether it is really necessary to have a Str compatible
> interface: the regular expressions are already different so exchanging
> the old against the new library would break code anyway.

	If the expressions were translated?

	BTW: I think some of the features of the regex are
parochial, and should be eliminated: support for case insensitive
matching, and matching 'words' etc should be dropped. Such things
might make sense in English, but are much too hard to build in
to a regexp facility correctly for internationalised text.

	By the way, how big can the DFA tables get?
Does it eliminate duplicate columns? 

	[Ocaml lex cannot support large enough tables for matching
ISO-10646 identifiers, when encoded using UTF-8. This is a real pain,
since all my languages specify UTF-8 encoded ISO-10646: I have to 
cheat, and assume 'almost everything' is a suitable character to
put in an identifier, and then check it afterwards. This makes it
hard to use use special symbols as tokens. I'm not sure why
this is, but I guess it doesn't eliminate duplicate columns?]

-- 
John (Max) Skaller, mailto:skaller@maxtal.com.au
10/1 Toxteth Rd Glebe NSW 2037 Australia voice: 61-2-9660-0850
checkout Vyper http://Vyper.sourceforge.net
download Interscript http://Interscript.sourceforge.net



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-06  0:51 features of PCRE-OCaml Markus Mottl
  2000-12-07 16:01 ` John Max Skaller
@ 2000-12-07 20:17 ` Miles Egan
  2000-12-08 12:30   ` Gerd Stolpmann
  1 sibling, 1 reply; 18+ messages in thread
From: Miles Egan @ 2000-12-07 20:17 UTC (permalink / raw)
  To: Markus Mottl; +Cc: OCAML

On Wed, Dec 06, 2000 at 01:51:39AM +0100, Markus Mottl wrote:
> Hello,
> 
> it seems that many people hadn't yet learnt about PCRE-OCaml (the
> OCaml-interface to the PCRE-library) and have asked for more information
> on the advantages as compared to the Str-library (or to Perl).

It would be wonderful if this became part of the standard distribution.  This is
a very handy library but I sometimes avoid it because I don't want to deal with
installing it everywhere I want to run my app.

-- 
miles



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-07 17:08     ` John Max Skaller
@ 2000-12-08  0:03       ` Markus Mottl
  2000-12-08 17:52         ` John Max Skaller
  2000-12-08  9:19       ` Alain Frisch
  1 sibling, 1 reply; 18+ messages in thread
From: Markus Mottl @ 2000-12-08  0:03 UTC (permalink / raw)
  To: John Max Skaller; +Cc: OCAML

On Fri, 08 Dec 2000, John Max Skaller wrote:
> By the way, how big can the DFA tables get?

You can get the size of regular expressions in bytes by calling
"Pcre.size" on the regexp. They usually need very little space.

> Does it eliminate duplicate columns? 

I have no idea how Phil's code works in detail: studying > 5000 LOCs of
rather low-level C did not seem to be a promising idea to me - the code
is fast enough, anyway ;)

- Markus Mottl

-- 
Markus Mottl, mottl@miss.wu-wien.ac.at, http://miss.wu-wien.ac.at/~mottl



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-07 17:08     ` John Max Skaller
  2000-12-08  0:03       ` Markus Mottl
@ 2000-12-08  9:19       ` Alain Frisch
  2000-12-08 18:11         ` John Max Skaller
  1 sibling, 1 reply; 18+ messages in thread
From: Alain Frisch @ 2000-12-08  9:19 UTC (permalink / raw)
  To: John Max Skaller; +Cc: OCAML

On Fri, 8 Dec 2000, John Max Skaller wrote:

> 	[Ocaml lex cannot support large enough tables for matching
> ISO-10646 identifiers, when encoded using UTF-8. This is a real pain,
> since all my languages specify UTF-8 encoded ISO-10646: I have to 
> cheat, and assume 'almost everything' is a suitable character to
> put in an identifier, and then check it afterwards. This makes it
> hard to use use special symbols as tokens. I'm not sure why
> this is, but I guess it doesn't eliminate duplicate columns?]

Have a look at wlex:
http://www.eleves.ens.fr:8080/home/frisch/soft
http://www.eleves.ens.fr:8080/home/frisch/info/wlex-20001006.tar.gz


<< This package consists of a lexer generator and the associated runtime
system. The new lexing model adds a "classification" layer between the
lexbuf and the lexer itself. This layer classifies characters from the
lexbuf into a few number of classes, on which the regexps in the lexer
specification are built. 

 This reduces the number of states and transitions in the automaton,
especially when working with large encodings such as UTF-8 (the primary
motivation for wlex).  >>

The development release of pxp may use wlex (same lexer for different
encodings: UTF-8, Latin-1).

wlex is distributed as a patch to ocamllex.


-- 
  Alain Frisch



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-07 20:17 ` features of PCRE-OCaml Miles Egan
@ 2000-12-08 12:30   ` Gerd Stolpmann
  2000-12-08 15:05     ` Markus Mottl
  0 siblings, 1 reply; 18+ messages in thread
From: Gerd Stolpmann @ 2000-12-08 12:30 UTC (permalink / raw)
  To: Miles Egan, Markus Mottl; +Cc: OCAML

On Thu, 07 Dec 2000, Miles Egan wrote:
>On Wed, Dec 06, 2000 at 01:51:39AM +0100, Markus Mottl wrote:
>> Hello,
>> 
>> it seems that many people hadn't yet learnt about PCRE-OCaml (the
>> OCaml-interface to the PCRE-library) and have asked for more information
>> on the advantages as compared to the Str-library (or to Perl).
>
>It would be wonderful if this became part of the standard distribution.  This is
>a very handy library but I sometimes avoid it because I don't want to deal with
>installing it everywhere I want to run my app.

I would appreciate this, too. PCRE regexps have often a simpler notation
because they need fewer backslashes. Furthermore, the reentrant interface of
PCRE has advantages in multi-threaded programs (with Str, you have to throw
with mutexes after the regexps...)

However, the PCRE stubs can be improved in one point: The master lock could be
released while the engine executes the regexp. Several threads could then
use the engine at the same time which would improve the responsiveness of
multi-threaded programs, and the programs would run faster on SMP systems.
(Currently, the so-called "master lock" prevents that more than one thread runs
at the same time (to avoid problems with uninitialized memory). However, when
O'Caml calls a C function which is thread-safe itself, one can release this
lock for the time of the call, because C can cope with uninitialized memory.)
As regexp matching takes some time, there could be an interesting speedup for
programs that massively apply regexps.

Gerd
-- 
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 100             
64293 Darmstadt     EMail:   gerd@gerd-stolpmann.de
Germany                     
----------------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-08 12:30   ` Gerd Stolpmann
@ 2000-12-08 15:05     ` Markus Mottl
  2000-12-08 15:40       ` Gerd Stolpmann
  0 siblings, 1 reply; 18+ messages in thread
From: Markus Mottl @ 2000-12-08 15:05 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Miles Egan, OCAML

Gerd Stolpmann schrieb am Friday, den 08. December 2000:
> However, the PCRE stubs can be improved in one point: The master lock could be
> released while the engine executes the regexp. Several threads could then
> use the engine at the same time which would improve the responsiveness of
> multi-threaded programs, and the programs would run faster on SMP systems.

Are there annotations which one could use to indicate reentrancy of
external functions? This would make it fairly easy to get rid of this
problem. This information could then also be used for program analysis.

- Markus Mottl

-- 
Markus Mottl, mottl@miss.wu-wien.ac.at, http://miss.wu-wien.ac.at/~mottl



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-08 15:05     ` Markus Mottl
@ 2000-12-08 15:40       ` Gerd Stolpmann
  2000-12-09  3:03         ` Markus Mottl
  0 siblings, 1 reply; 18+ messages in thread
From: Gerd Stolpmann @ 2000-12-08 15:40 UTC (permalink / raw)
  To: Markus Mottl; +Cc: Miles Egan, OCAML

On Fri, 08 Dec 2000, Markus Mottl wrote:
>Gerd Stolpmann schrieb am Friday, den 08. December 2000:
>> However, the PCRE stubs can be improved in one point: The master lock could be
>> released while the engine executes the regexp. Several threads could then
>> use the engine at the same time which would improve the responsiveness of
>> multi-threaded programs, and the programs would run faster on SMP systems.
>
>Are there annotations which one could use to indicate reentrancy of
>external functions? This would make it fairly easy to get rid of this
>problem. This information could then also be used for program analysis.

There are two functions making it easy: enter_blocking_section and
leave_blocking_section. For example, the stub for the read syscall of the Unix
library:

value unix_read(value fd, value buf, value ofs, value len) /* ML */
{
  long numbytes;
  int ret;
  char iobuf[UNIX_BUFFER_SIZE];

  Begin_root (buf);
    numbytes = Long_val(len);
    if (numbytes > UNIX_BUFFER_SIZE) numbytes = UNIX_BUFFER_SIZE;
    enter_blocking_section();
    ret = read(Int_val(fd), iobuf, (int) numbytes);
    leave_blocking_section();
    if (ret == -1) uerror("read", Nothing);
    bcopy(iobuf, &Byte(buf, Long_val(ofs)), ret);
  End_roots();
  return Val_int(ret);
}

Gerd
-- 
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 100             
64293 Darmstadt     EMail:   gerd@gerd-stolpmann.de
Germany                     
----------------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-08  0:03       ` Markus Mottl
@ 2000-12-08 17:52         ` John Max Skaller
  0 siblings, 0 replies; 18+ messages in thread
From: John Max Skaller @ 2000-12-08 17:52 UTC (permalink / raw)
  To: Markus Mottl; +Cc: OCAML

Markus Mottl wrote:

> > Does it eliminate duplicate columns?
> 
> I have no idea how Phil's code works in detail: studying > 5000 LOCs of
> rather low-level C did not seem to be a promising idea to me - the code
> is fast enough, anyway ;)

Eliminating duplicate columns slows both compilation and matching
time down. But it makes the DFA smaller.

-- 
John (Max) Skaller, mailto:skaller@maxtal.com.au
10/1 Toxteth Rd Glebe NSW 2037 Australia voice: 61-2-9660-0850
checkout Vyper http://Vyper.sourceforge.net
download Interscript http://Interscript.sourceforge.net



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-08  9:19       ` Alain Frisch
@ 2000-12-08 18:11         ` John Max Skaller
  2000-12-08 19:48           ` Alain Frisch
  0 siblings, 1 reply; 18+ messages in thread
From: John Max Skaller @ 2000-12-08 18:11 UTC (permalink / raw)
  To: Alain Frisch; +Cc: OCAML

Alain Frisch wrote:
 
> Have a look at wlex:
> http://www.eleves.ens.fr:8080/home/frisch/soft
> http://www.eleves.ens.fr:8080/home/frisch/info/wlex-20001006.tar.gz

	Thanks ...
> 
> << This package consists of a lexer generator and the associated runtime
> system. The new lexing model adds a "classification" layer between the
> lexbuf and the lexer itself. This layer classifies characters from the
> lexbuf into a few number of classes, on which the regexps in the lexer
> specification are built.
> 
>  This reduces the number of states and transitions in the automaton,

	No, it has no effect on the number of states (rows of the
DFA matrix). It reduces the number of columns.

> wlex is distributed as a patch to ocamllex.

	That's always a worry. Am I confused: does it change the
Ocaml source tree in place, or just copy bits of it and patch them?
Why is Findlib required? (Thats a patch too, isn't it?)

-- 
John (Max) Skaller, mailto:skaller@maxtal.com.au
10/1 Toxteth Rd Glebe NSW 2037 Australia voice: 61-2-9660-0850
checkout Vyper http://Vyper.sourceforge.net
download Interscript http://Interscript.sourceforge.net



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-08 18:11         ` John Max Skaller
@ 2000-12-08 19:48           ` Alain Frisch
  2000-12-09 17:07             ` John Max Skaller
  0 siblings, 1 reply; 18+ messages in thread
From: Alain Frisch @ 2000-12-08 19:48 UTC (permalink / raw)
  To: John Max Skaller; +Cc: OCAML

On Sat, 9 Dec 2000, John Max Skaller wrote:

> > Have a look at wlex:
> >  This reduces the number of states and transitions in the automaton,
> 
> 	No, it has no effect on the number of states (rows of the
> DFA matrix). It reduces the number of columns.

Yes it does: the extra classification layer may consume more than one
character in the lexbuf. When you parse UTF-8 with ocamllex, you have
"waiting" states in the automaton corresponding to multibyte encoded code
points.

> > wlex is distributed as a patch to ocamllex.
> 
> 	That's always a worry. Am I confused: does it change the
> Ocaml source tree in place, or just copy bits of it and patch them?
> Why is Findlib required? (Thats a patch too, isn't it?)

It would have been much easier for me not to distribute wlex as a patch,
but this is incompatible with the license.

The Makefile doesn't touch the source tree in place; it just makes a copy
of the lex/ directory and works on it. Findlib, which is not a patch, is
required only to install the runtime support library of wlex; you can
of course use this library directly.


-- 
  Alain Frisch



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-08 15:40       ` Gerd Stolpmann
@ 2000-12-09  3:03         ` Markus Mottl
  2000-12-09 13:12           ` Gerd Stolpmann
  0 siblings, 1 reply; 18+ messages in thread
From: Markus Mottl @ 2000-12-09  3:03 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Miles Egan, OCAML

Gerd Stolpmann schrieb am Friday, den 08. December 2000:
> There are two functions making it easy: enter_blocking_section and
> leave_blocking_section. For example, the stub for the read syscall of the Unix
> library:

Ok, I have found an article by Xavier on these functions:

  http://caml.inria.fr/archives/199905/msg00035.html

So if I am not mistaken, a function that calls the GC or allocates memory
on the OCaml-heap cannot be considered reentrant even if its semantics
is otherwise referentially transparent. This means that just "tagging"
a function as "pure" is no guarantee that it won't mess up the runtime
when e.g. calling the GC concurrently - right?

In other terms I can put those functions around the largest section of
C-code that doesn't interfere with the OCaml-runtime system - then I
should be safe.

The only question now is: would it really pay for pattern matching in the
PCRE? I have taken a look at the implementation of these functions and on
their use, but have only found cases where some function really blocks for
either an indefinite (e.g. read) or at least potentially very long amount
of time (e.g. gethostbyaddr, which might need to contact a nameserver).

Without threads we won't benefit, anyway, and if we use threads, there
is a small overhead associated with calling these functions. Pattern
matching maybe does not eat up so much time in the average case that this
is justified. Any experiences or suggestions when using these functions
is advisable?

- Markus Mottl

-- 
Markus Mottl, mottl@miss.wu-wien.ac.at, http://miss.wu-wien.ac.at/~mottl



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-09  3:03         ` Markus Mottl
@ 2000-12-09 13:12           ` Gerd Stolpmann
  2000-12-10  0:32             ` Markus Mottl
  0 siblings, 1 reply; 18+ messages in thread
From: Gerd Stolpmann @ 2000-12-09 13:12 UTC (permalink / raw)
  To: Markus Mottl; +Cc: Miles Egan, OCAML

On Sat, 09 Dec 2000, Markus Mottl wrote:
>Gerd Stolpmann schrieb am Friday, den 08. December 2000:
>> There are two functions making it easy: enter_blocking_section and
>> leave_blocking_section. For example, the stub for the read syscall of the Unix
>> library:
>
>Ok, I have found an article by Xavier on these functions:
>
>  http://caml.inria.fr/archives/199905/msg00035.html
>
>So if I am not mistaken, a function that calls the GC or allocates memory
>on the OCaml-heap cannot be considered reentrant even if its semantics
>is otherwise referentially transparent. This means that just "tagging"
>a function as "pure" is no guarantee that it won't mess up the runtime
>when e.g. calling the GC concurrently - right?

For example, the situation must not occur where one thread is initializing
memory and is interrupted by another thread allocating memory and calling the
GC. One precondition of the GC is that memory is always initialized.

"Reentrancy" is an abstract view on the function interface; it is not true for
lower coding levels because (heap) memory is nothing but a large global variable
implicitly shared by every piece of code.

>In other terms I can put those functions around the largest section of
>C-code that doesn't interfere with the OCaml-runtime system - then I
>should be safe.

I think so.

>The only question now is: would it really pay for pattern matching in the
>PCRE? I have taken a look at the implementation of these functions and on
>their use, but have only found cases where some function really blocks for
>either an indefinite (e.g. read) or at least potentially very long amount
>of time (e.g. gethostbyaddr, which might need to contact a nameserver).
>
>Without threads we won't benefit, anyway, and if we use threads, there
>is a small overhead associated with calling these functions. Pattern
>matching maybe does not eat up so much time in the average case that this
>is justified. Any experiences or suggestions when using these functions
>is advisable?

I would say it depends on the problem size. For example, when searching in a
long text it is definitely worth-while to release the masterlock.

The more interesting case is the average text processing program with many
invocations of the PCRE engine with average problem sizes. The question is
whether the sum of all invocations is big enough such that an effect is
measurable. Ideally, I can imagine a two processor system in which one CPU only
executes Caml code, and the other only regexps. From the Caml CPU's point of
view, the PCRE calls are (ideally) cost-free (because they are delegated to the
other CPU). However, there is a synchronization overhead, and nothing is won if
the Caml CPU must spend more time with synchronization than it would spend with
executing the regexp itself.

I think it is worth an experiment.

Gerd
-- 
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 100             
64293 Darmstadt     EMail:   gerd@gerd-stolpmann.de
Germany                     
----------------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-08 19:48           ` Alain Frisch
@ 2000-12-09 17:07             ` John Max Skaller
  0 siblings, 0 replies; 18+ messages in thread
From: John Max Skaller @ 2000-12-09 17:07 UTC (permalink / raw)
  To: Alain Frisch; +Cc: OCAML

Alain Frisch wrote:
> 
> On Sat, 9 Dec 2000, John Max Skaller wrote:
> 
> > > Have a look at wlex:
> > >  This reduces the number of states and transitions in the automaton,
> >
> >       No, it has no effect on the number of states (rows of the
> > DFA matrix). It reduces the number of columns.
> 
> Yes it does: the extra classification layer may consume more than one
> character in the lexbuf. When you parse UTF-8 with ocamllex, you have
> "waiting" states in the automaton corresponding to multibyte encoded code
> points.

	You're right. I've since downloaded wlex and given it
a whirl. Good stuff!


-- 
John (Max) Skaller, mailto:skaller@maxtal.com.au
10/1 Toxteth Rd Glebe NSW 2037 Australia voice: 61-2-9660-0850
checkout Vyper http://Vyper.sourceforge.net
download Interscript http://Interscript.sourceforge.net



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: features of PCRE-OCaml
  2000-12-09 13:12           ` Gerd Stolpmann
@ 2000-12-10  0:32             ` Markus Mottl
  0 siblings, 0 replies; 18+ messages in thread
From: Markus Mottl @ 2000-12-10  0:32 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Miles Egan, OCAML

Gerd Stolpmann schrieb am Saturday, den 09. December 2000:
> I would say it depends on the problem size. For example, when searching in a
> long text it is definitely worth-while to release the masterlock.
[snip]
> I think it is worth an experiment.

I have first given it a thought experiment and came to the conclusion
that it does not seem to be a good idea to have a default with possibly
superfluous overheads. If somebody really wants to exploit multi-processor
machines with OCaml, it is probably best to have additional functions
(or even better: a different module) that are tailored for parallel
computation.

This moves the responsibility to the programmer, who usually knows
best whether parallelism for specific uses is worth the overhead. If
somebody wants to try this out (placing those fancy master lock functions
around various code parts in the C-interface (pcre-OCaml/pcre_intf.c)
and measure the performance difference), just fetch the current sources
from the CVS-repository on Sourceforge:

  http://sourceforge.net/cvs/?group_id=15427

It should be very straightforward to add a "parallel" version, but the
tuning + experimentation may require some work. Maybe two submodules
in the "Pcre"-module would be nice: something like "Pcre.Normal" and
"Pcre.Parallel", both of which have the same interface and can even
share values so that one could choose at runtime which version should
execute a regular expression (or compile one, etc.).

I'd be happy to let you integrate a solution for parallel pattern
matching. I don't need this feature right now, but I'd be interested in
reports on executing regexps of PCRE-OCaml in parallel on a 32 processor
machine... ;)

- Markus Mottl

-- 
Markus Mottl, mottl@miss.wu-wien.ac.at, http://miss.wu-wien.ac.at/~mottl



^ permalink raw reply	[flat|nested] 18+ messages in thread

* unicode support
  2000-12-07 16:01 ` John Max Skaller
  2000-12-07 16:32   ` Markus Mottl
@ 2000-12-14 17:35   ` Nickolay Semyonov
  1 sibling, 0 replies; 18+ messages in thread
From: Nickolay Semyonov @ 2000-12-14 17:35 UTC (permalink / raw)
  To: caml-list

John Max Skaller wrote:
> 
> Note also, Python 2.0 uses a modified library which does something
> PCRE-OCaml cannot: it works with Unicode characters (supposedly).
> 

Good point. What about native unicode support in Ocaml?

Nickolay



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2000-12-15 18:02 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-12-06  0:51 features of PCRE-OCaml Markus Mottl
2000-12-07 16:01 ` John Max Skaller
2000-12-07 16:32   ` Markus Mottl
2000-12-07 17:08     ` John Max Skaller
2000-12-08  0:03       ` Markus Mottl
2000-12-08 17:52         ` John Max Skaller
2000-12-08  9:19       ` Alain Frisch
2000-12-08 18:11         ` John Max Skaller
2000-12-08 19:48           ` Alain Frisch
2000-12-09 17:07             ` John Max Skaller
2000-12-14 17:35   ` unicode support Nickolay Semyonov
2000-12-07 20:17 ` features of PCRE-OCaml Miles Egan
2000-12-08 12:30   ` Gerd Stolpmann
2000-12-08 15:05     ` Markus Mottl
2000-12-08 15:40       ` Gerd Stolpmann
2000-12-09  3:03         ` Markus Mottl
2000-12-09 13:12           ` Gerd Stolpmann
2000-12-10  0:32             ` Markus Mottl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).