caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Re: substring match like "strstr"
@ 2000-12-14 21:12 Ruchira Datta
  0 siblings, 0 replies; 23+ messages in thread
From: Ruchira Datta @ 2000-12-14 21:12 UTC (permalink / raw)
  To: caml-list

John Prevost wrote:
>I just grabbed the glibc version, and found the following comment at
>the head of it:

>/*
> * My personal strstr() implementation that beats most other algorithms.
> * Until someone tells me otherwise, I assume that this is the
> * fastest implementation of strstr() in C.
> * I deliberately chose not to comment it.  You should have at least
> * as much fun trying to understand it, as I had to write it :-).
> *
> * Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de */

>Since the text of the function involves not only nasty gotos
>(particularly frightening is "goto shloop;" but also pointer tricks
>and register declarations, I'd say it's to be expected that it'll
>outperform O'Caml.  I am, however, seeing if I can figure out the
>essential algorithm used (if there's really anything smart to it) and
>will let you know if I come up with anything extra special.

The frightening tricks are shocking to us functional programmers :-) but
once you get over your shock and read through it, you won't have any
problem understanding it.  At each point, don't ask yourself ``why is
this statement written in this obfuscated way?''  Just assume as an
article of faith that this incantation is invoked to propitiate the
Almighty Assembler :-).  Instead ask yourself ``what does this statement
do?''  That is fairly simple...

There is absolutely nothing smart about the algorithm; it is exactly
the brute-force one.  For example, the first character of the needle
is b, and the second character is c.  Suppose we have matched b and
c and then find that we don't match the whole needle.  Then we back
up our haystack pointer all the way back to the character after the
last potential match, i.e., to the character that matched c, and start 
from the beginning again by checking if it matches b.  We do this every 
time regardless of whether b and c are different.  I guess this is to 
save one register variable.

The speed comes from the low-level assembly language optimization; C is
being used as portable assembler.  I am sure it took a lot of ingenuity
and very intimate knowledge of the gcc compiler to do such a low-level
optimization while remaining in portable C.  But it is not something for
OCaml users to emulate.

Ruchira Datta
datta@math.berkeley.edu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-14  8:02           ` eijiro_sumii
@ 2000-12-14 21:53             ` Stephan Tolksdorf
  0 siblings, 0 replies; 23+ messages in thread
From: Stephan Tolksdorf @ 2000-12-14 21:53 UTC (permalink / raw)
  To: caml-list

Hello,

some time ago I've been interested in highly optimized text search
algorithms and wrote some in assembler.

A very good analysis of text search algorithms can be found in the
8/97 issue of the German magazin C't (www.heise.de/ct) - "Blitzfindig"
by Michael Tamm.
This article compares the algoritms "Brute Force" "Boyer-Moore",
"Quicksearch" and the advanced "T-Search" in detail.

The C source can be found on: ftp://ftp.heise.de/pub/ct/listings/ct9708.zip
Have fun...

MfG,
  Stephan Tolksdorf




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-14  6:48         ` Chris Hecker
@ 2000-12-14  8:02           ` eijiro_sumii
  2000-12-14 21:53             ` Stephan Tolksdorf
  0 siblings, 1 reply; 23+ messages in thread
From: eijiro_sumii @ 2000-12-14  8:02 UTC (permalink / raw)
  To: checker; +Cc: caml-list, gerd, sumii

> Is that C the simple C one I posted, or the libc one on sparc?

It's the one in libc on Solaris, so may be tuned by hand in assembly.

> Your new strstr is actually slower on x86 than your first one, but I
> haven't looked into why yet.

That's interesting.  I confirmed it myself on my x86 machine with my
program:

	strstr_imp2	88.530
	strstr_fun	124.210
	strstr_fun2	127.120
	glibc strstr	61.310

Seeing such strong machine dependency, it seems yet more reasonable to
use the "default" strstr provided by the system (unless the _pattern_
is long enough, in which case KMP or BM might perform better), even
though there is some overhead --- about 5% in my profiling on SPARC
--- of interfacing a C function to OCaml.  This is a natural (though
boring) conclusion, I think.

Eijiro



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-14  3:36       ` eijiro_sumii
@ 2000-12-14  6:48         ` Chris Hecker
  2000-12-14  8:02           ` eijiro_sumii
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Hecker @ 2000-12-14  6:48 UTC (permalink / raw)
  To: eijiro_sumii; +Cc: caml-list, gerd, sumii


>        strstr_imp2     90.57
>        strstr_fun2     89.64
>        C strstr        53.76

Is that C the simple C one I posted, or the libc one on sparc?

Here are my timings with yours added:

strstr_fun total = 1.135000 (5,0,19,13)
strstr_fun2 total = 1.470000 (5,0,19,13)
strstr_imp total = 1.030000 (5,0,19,13)
strstr_imp2 total = 0.965000 (5,0,19,13)
strstr_c total = 0.790000 (5,0,19,13)
strstr_libc total = 0.370000 (5,0,19,13)

Your new strstr is actually slower on x86 than your first one, but I haven't looked into why yet.  It may also be data (since my data is obviously lame garbage).  I should try Alpha Linux...

Chris




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-11 21:07     ` Chris Hecker
                         ` (2 preceding siblings ...)
  2000-12-12 10:07       ` Sven LUTHER
@ 2000-12-14  3:36       ` eijiro_sumii
  2000-12-14  6:48         ` Chris Hecker
  3 siblings, 1 reply; 23+ messages in thread
From: eijiro_sumii @ 2000-12-14  3:36 UTC (permalink / raw)
  To: checker; +Cc: caml-list, gerd, sumii

> Okay, I'm curious, so I'll port the code to caml and include it
> below as well (as practice for myself).  Can you try it in your test
> harness?

I tried the optimized version of the imperative strstr, along with an
optimized version of the functional strstr (attached at the end).  The
results (on the SPARC machine) were:

	strstr_imp2	90.57
	strstr_fun2	89.64
	C strstr	53.76

So the imperative version and the functional version were comparable,
though both were slower than the C version.  My guess is that they
actually compile into rather similar code, because all recursive calls
are in tail positions.

Anyway, for my application, the libc version seems the most reasonable
choice, as far as I use a strstr-like function at all...

// Eijiro Sumii (http://www.yl.is.s.u-tokyo.ac.jp/~sumii/)
// 
// Ph.D. Student at Department of IS, University of Tokyo
// Visiting Scholar at Department of CIS, University of Pennsylvania

let strstr_fun2 pat str =
  let patlim = String.length pat - 1 in
  let strlim = String.length str - 1 in
  let rec sub patpos strpos =
    (* compare pat[0..patpos] and str[(strpos-patpos)..strpos] *)
    if patpos < 0 then true else
    if pat.[patpos] <> str.[strpos] then false else
    sub (patpos - 1) (strpos - 1) in
  let rec search strpos =
    (* compare pat[0..patlim] and str[(strpos-patlim)..strpos] *)
    if strpos > strlim then raise Not_found else
    if sub patlim strpos then strpos - patlim else
    search (strpos + 1) in
  search patlim
;;



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-13 10:53               ` Julian Assange
@ 2000-12-13 13:28                 ` Eijiro Sumii
  0 siblings, 0 replies; 23+ messages in thread
From: Eijiro Sumii @ 2000-12-13 13:28 UTC (permalink / raw)
  To: proff; +Cc: caml-list, sumii

Julian Assange <proff@iq.org> wrote:
> All of these more sophisticated methods are only going to be a win
> when you are searching for larger strings. strstr is brutally simple,
> has little setup/cleanup time, and hence is hard to beat for short
> strings.

That is completely correct, and the substrings in my application are
indeed short (3 characters), but:

eijiro_sumii@anet.ne.jp wrote:
| Since my program calls the function more than 3000 times for _each_
| pattern, I did partial application, of course.  Even so, kmp in OCaml
| still performed much worse than strstr in libc (both Sun's and GNU's),
| at least in this program.

I expected that 3000 strings for 1 substring, each of which are about
20 characters long as I mentioned before, are enough for paying the
"setup/cleanup" overhead.  (On the other hand, strstr in libc doesn't
benefit from the partial application, of course.)

Eijiro



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-13 10:02             ` eijiro_sumii
  2000-12-13 10:17               ` Eijiro Sumii
@ 2000-12-13 10:53               ` Julian Assange
  2000-12-13 13:28                 ` Eijiro Sumii
  1 sibling, 1 reply; 23+ messages in thread
From: Julian Assange @ 2000-12-13 10:53 UTC (permalink / raw)
  To: eijiro_sumii; +Cc: caml-list, Jean-Christophe.Filliatre, jmp, sumii, proff

eijiro_sumii@anet.ne.jp writes:

> Hi everyone,
> 
> Thank you very much for a lot of kind advice!
> 
> Jean-Christophe Filliatre <Jean-Christophe.Filliatre@lri.fr> wrote:
> > I   missed  the   beginning  of   the  discussion,   but  implementing
> > Knuth-Morris-Pratt is quite easy. The  code is given below (26 lines),
> 
> Actually, I found the (almost) same code on the web and tried it.  The
> results (again, execution time in seconds) were:
> 
> 		SPARC	Pentium
> 	strstr	52.68	57.050
> 	kmp	111.52	143.490

All of these more sophisticated methods are only going to be a win
when you are searching for larger strings. strstr is brutally simple,
has little setup/cleanup time, and hence is hard to beat for short
strings. Additionally gcc, will replace various s* calls with in-line
assembly versions. I'm not sure if this includes strstr (it is a
little more complicated than most).

Cheers,
Julian.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-13 10:02             ` eijiro_sumii
@ 2000-12-13 10:17               ` Eijiro Sumii
  2000-12-13 10:53               ` Julian Assange
  1 sibling, 0 replies; 23+ messages in thread
From: Eijiro Sumii @ 2000-12-13 10:17 UTC (permalink / raw)
  To: caml-list; +Cc: Jean-Christophe.Filliatre, proff, jmp, sumii

> Actually, I found the (almost) same code on the web and tried it.  The
> results (again, execution time in seconds) were:
> 
> 		SPARC	Pentium
> 	strstr	52.68	57.050
> 	kmp	111.52	143.490

Oops, I forgot to give "-unsafe" to ocamlopt in kmp.  With -unsafe,
the results have improved (but still worse than strstr):

		SPARC	Pentium
	kmp	73.50	79.940

Eijiro



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-12 12:28           ` Jean-Christophe Filliatre
@ 2000-12-13 10:02             ` eijiro_sumii
  2000-12-13 10:17               ` Eijiro Sumii
  2000-12-13 10:53               ` Julian Assange
  0 siblings, 2 replies; 23+ messages in thread
From: eijiro_sumii @ 2000-12-13 10:02 UTC (permalink / raw)
  To: caml-list; +Cc: Jean-Christophe.Filliatre, proff, jmp, sumii

Hi everyone,

Thank you very much for a lot of kind advice!

Jean-Christophe Filliatre <Jean-Christophe.Filliatre@lri.fr> wrote:
> I   missed  the   beginning  of   the  discussion,   but  implementing
> Knuth-Morris-Pratt is quite easy. The  code is given below (26 lines),

Actually, I found the (almost) same code on the web and tried it.  The
results (again, execution time in seconds) were:

		SPARC	Pentium
	strstr	52.68	57.050
	kmp	111.52	143.490

The SPARC machine is the same as before.  The Pentium machine is
running Linux 2.2.17 inside VMWare 2.0.3 (with 48MB virtual main
memory) on Windows 2000 with a Pentium II 400MHz processor and 128MB
main memory.

Since my program calls the function more than 3000 times for _each_
pattern, I did partial application, of course.  Even so, kmp in OCaml
still performed much worse than strstr in libc (both Sun's and GNU's),
at least in this program.

If you'd like to check it yourself, the source code is available at
the same URL as before
(http://www.yl.is.s.u-tokyo.ac.jp/~sumii/tmp/hc.tar.gz).

Julian Assange <proff@iq.org> wrote:
> I'm interested in this library. Can you tell me a little more? What's
> your ETA?

It is explained a little at:

  http://www.hypothesiscreator.net/

It has been developed in C++, but we began to port it to OCaml for
clarity and convenience (and fun:-).

Julian Assange <proff@iq.org> wrote:
> If you are really calling strstr 400,000,000 times, I strongly suggest
> the use of Boyer-Moore. Is your alphabet amino-acids or base-pairs? If

Can Boyer-Moore (in OCaml) be _much_ faster than KMP, so that it beats
strstr in libc?

John Prevost <jmp@arsdigita.com> wrote:
> I just grabbed the glibc version, and found the following comment at
> the head of it:

That was exactly what I found too (and mentioned a bit in my previous
message).  Maybe I should also disassemble and check Sun's strstr...?

Regards,

Eijiro



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-13  1:12         ` John Prevost
@ 2000-12-13  2:35           ` Chris Hecker
  0 siblings, 0 replies; 23+ messages in thread
From: Chris Hecker @ 2000-12-13  2:35 UTC (permalink / raw)
  To: John Prevost, eijiro_sumii; +Cc: caml-list, gerd, sumii


>The file strstr.c is included below for your edification.  I'd argue
>that any Caml version is likely to a be a wee bit easier to
>understand.

Okay, so here's a new challenge:  write the most readable strstr you can in ocaml.  The restriction is you have to actually implement the algorithm, not just call the regex library.  :)

What is the most functional and readable one we can come up with?  And then we'll compare its performance and see if the readable/performance curve is appropriate.  Eijiro's was relatively short and functional, but still not completely transparent.  My imperative one sucks for readability because of all the references and variables.

Chris




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-12  3:28       ` eijiro_sumii
@ 2000-12-13  1:12         ` John Prevost
  2000-12-13  2:35           ` Chris Hecker
  0 siblings, 1 reply; 23+ messages in thread
From: John Prevost @ 2000-12-13  1:12 UTC (permalink / raw)
  To: eijiro_sumii; +Cc: checker, caml-list, gerd, sumii

>>>>> "es" == eijiro sumii <eijiro_sumii@anet.ne.jp> writes:

    >> Any ideas why strstr blows the others away?  What's the libc
    >> strstr look like?

    es> I have no idea, unfortunately...

I just grabbed the glibc version, and found the following comment at
the head of it:

/*
 * My personal strstr() implementation that beats most other algorithms.
 * Until someone tells me otherwise, I assume that this is the
 * fastest implementation of strstr() in C.
 * I deliberately chose not to comment it.  You should have at least
 * as much fun trying to understand it, as I had to write it :-).
 *
 * Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de */

Since the text of the function involves not only nasty gotos
(particularly frightening is "goto shloop;" but also pointer tricks
and register declarations, I'd say it's to be expected that it'll
outperform O'Caml.  I am, however, seeing if I can figure out the
essential algorithm used (if there's really anything smart to it) and
will let you know if I come up with anything extra special.

The file strstr.c is included below for your edification.  I'd argue
that any Caml version is likely to a be a wee bit easier to
understand.

John.


/* Return the offset of one string within another.
   Copyright (C) 1994, 1996, 1997 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Library General Public License as
   published by the Free Software Foundation; either version 2 of the
   License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Library General Public License for more details.

   You should have received a copy of the GNU Library General Public
   License along with the GNU C Library; see the file COPYING.LIB.  If not,
   write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330,
   Boston, MA 02111-1307, USA.  */

/*
 * My personal strstr() implementation that beats most other algorithms.
 * Until someone tells me otherwise, I assume that this is the
 * fastest implementation of strstr() in C.
 * I deliberately chose not to comment it.  You should have at least
 * as much fun trying to understand it, as I had to write it :-).
 *
 * Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de	*/

#if HAVE_CONFIG_H
# include <config.h>
#endif

#if defined _LIBC || defined HAVE_STRING_H
# include <string.h>
#endif

typedef unsigned chartype;

#undef strstr

char *
strstr (phaystack, pneedle)
     const char *phaystack;
     const char *pneedle;
{
  register const unsigned char *haystack, *needle;
  register chartype b, c;

  haystack = (const unsigned char *) phaystack;
  needle = (const unsigned char *) pneedle;

  b = *needle;
  if (b != '\0')
    {
      haystack--;				/* possible ANSI violation */
      do
	{
	  c = *++haystack;
	  if (c == '\0')
	    goto ret0;
	}
      while (c != b);

      c = *++needle;
      if (c == '\0')
	goto foundneedle;
      ++needle;
      goto jin;

      for (;;)
        {
          register chartype a;
	  register const unsigned char *rhaystack, *rneedle;

	  do
	    {
	      a = *++haystack;
	      if (a == '\0')
		goto ret0;
	      if (a == b)
		break;
	      a = *++haystack;
	      if (a == '\0')
		goto ret0;
shloop:	    }
          while (a != b);

jin:	  a = *++haystack;
	  if (a == '\0')
	    goto ret0;

	  if (a != c)
	    goto shloop;

	  rhaystack = haystack-- + 1;
	  rneedle = needle;
	  a = *rneedle;

	  if (*rhaystack == a)
	    do
	      {
		if (a == '\0')
		  goto foundneedle;
		++rhaystack;
		a = *++needle;
		if (*rhaystack != a)
		  break;
		if (a == '\0')
		  goto foundneedle;
		++rhaystack;
		a = *++needle;
	      }
	    while (*rhaystack == a);

	  needle = rneedle;		/* took the register-poor approach */

	  if (a == '\0')
	    break;
        }
    }
foundneedle:
  return (char*) haystack;
ret0:
  return 0;
}





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-10 15:39     ` Gerd Stolpmann
  2000-12-11  3:57       ` eijiro_sumii
@ 2000-12-12 13:58       ` Julian Assange
  1 sibling, 0 replies; 23+ messages in thread
From: Julian Assange @ 2000-12-12 13:58 UTC (permalink / raw)
  To: gerd; +Cc: eijiro_sumii, caml-list, sumii, proff

Gerd Stolpmann <gerd@gerd-stolpmann.de> writes:

> >The results (execution time in seconds) were as follows.
> >
> >  strstr   55.74
> >  regexp  154.37
> >  OCamlA  302.57
> >  OCamlB  129.23
> 
> I did not expect that OCamlB performs so well; so I suggested OcamlA and
> regexp (which are both fast for the bytecode interpreter, too). I think it
> depends also on the problem size (especially on the length of the substring).
> 
> Gerd

If you are really calling strstr 400,000,000 times, I strongly suggest
the use of Boyer-Moore. Is your alphabet amino-acids or base-pairs? If
the latter, I have developed parallel hashing version of Boyer-More
(lets call it Assange-Boyer-More :), which beats the pants off
Knuth-Moris-Pratt. 500 long substrings searches are only about 50%
slower than 1, although for this application, if you have the memory,
you might also want to look at suffix trees or suffix arrays.

--
 Julian Assange        |If you want to build a ship, don't drum up people
                       |together to collect wood or assign them tasks
 proff@iq.org          |and work, but rather teach them to long for the endless
 proff@gnu.ai.mit.edu  |immensity of the sea. -- Antoine de Saint Exupery



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-12  5:06         ` Chris Hecker
@ 2000-12-12 12:28           ` Jean-Christophe Filliatre
  2000-12-13 10:02             ` eijiro_sumii
  0 siblings, 1 reply; 23+ messages in thread
From: Jean-Christophe Filliatre @ 2000-12-12 12:28 UTC (permalink / raw)
  To: Chris Hecker; +Cc: gerd, eijiro_sumii, caml-list, sumii


Hello,

I   missed  the   beginning  of   the  discussion,   but  implementing
Knuth-Morris-Pratt is quite easy. The  code is given below (26 lines),
following Sedgewick's book "Algorithms".  Notice that the function kmp
can be partially applied to  a pattern, therefore computing the offset
table only once. The complexity is N+M where N is the size of the text
and M the size of the pattern.

(The following program was proved  correct in the Coq Proof Assistant;
without the Not_found exception, however)

Hope this helps,
-- 
Jean-Christophe FILLIATRE
  mailto:Jean-Christophe.Filliatre@lri.fr
  http://www.lri.fr/~filliatr

======================================================================
let initnext p =
  let m = String.length p in
  let next = Array.create m 0 in
  let i = ref 1 and j = ref 0 in
  while !i < m-1 do
    if p.[!i] = p.[!j] then begin
      incr i; incr j; next.(!i) <- !j
    end else
      if !j = 0 then begin incr i; next.(!i) <- 0 end else j := next.(!j)
  done;
  next

let kmp p =
  let next = initnext p in
  fun a ->
    let n = String.length a
    and m = String.length p
    and i = ref 0
    and j = ref 0 in
    while !j < m & !i < n do
      if a.[!i] = p.[!j] then begin
	incr i; incr j
      end else
	if !j = 0 then incr i else j := next.(!j)
    done;
    if !j >= m then !i-m else raise Not_found
======================================================================



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-11 21:07     ` Chris Hecker
  2000-12-11 22:22       ` Gerd Stolpmann
  2000-12-12  3:28       ` eijiro_sumii
@ 2000-12-12 10:07       ` Sven LUTHER
  2000-12-14  3:36       ` eijiro_sumii
  3 siblings, 0 replies; 23+ messages in thread
From: Sven LUTHER @ 2000-12-12 10:07 UTC (permalink / raw)
  To: Chris Hecker; +Cc: eijiro_sumii, caml-list, gerd, sumii

On Mon, Dec 11, 2000 at 01:07:22PM -0800, Chris Hecker wrote:
> 
> >The results (execution time in seconds) were as follows.
> >  strstr   55.74
> >  regexp  154.37
> >  OCamlA  302.57
> >  OCamlB  129.23
> 
> Any ideas why strstr blows the others away?  What's the libc strstr look like?  I just looked in the MSVC source and it's a braindead while loop (copied below), so it's not like it's doing a fancy Boyer-Moore or anything.  This is exactly the kind of problem on which I'd expect caml to come within 10% of c.

Is this kind of thing not one of the field which can take advantage of the
vector units included in the processor, like mmx, altivec and the equivalent
in alpha or sparc processors ? Not sure though, it would depend on the actual
implementation used and command set.

Friendly,

Sven Luther



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-11 22:22       ` Gerd Stolpmann
@ 2000-12-12  5:06         ` Chris Hecker
  2000-12-12 12:28           ` Jean-Christophe Filliatre
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Hecker @ 2000-12-12  5:06 UTC (permalink / raw)
  To: gerd, eijiro_sumii, caml-list; +Cc: gerd, sumii


>  let i_limit = strlen - patlen in

Ah, duh.  Good idea.

Here are some new timings, including the _imp2 (I fixed a minor fencepost bug in yours in an admittedly cheesy way; the code's below):

Under Win2k cmd.exe:
strstr total = 0.610000 (5,0,19,13)
strstr_imp total = 0.593000 (5,0,19,13)
strstr_imp2 total = 0.563000 (5,0,19,13)
strstr_c total = 0.469000 (5,0,19,13)

Under Win98 command.com:
strstr total = 1.460000 (5,0,19,13)
strstr_imp total = 1.020000 (5,0,19,13)
strstr_imp2 total = 0.955000 (5,0,19,13)
strstr_c total = 0.800000 (5,0,19,13)

Under Win98 emacs shell-command:
strstr total = 1.125000 (5,0,19,13)
strstr_imp total = 1.025000 (5,0,19,13)
strstr_imp2 total = 0.960000 (5,0,19,13)
strstr_c total = 0.800000 (5,0,19,13)

I doubled the number of iterations to try to figure out the emacs/command.com difference, but to no avail.

However, then I decided to check to see if this was actually the strstr in the library, so instead of calling that c code, I called string.h's strstr.  I discovered that there's actually an intel-asm version of strstr that's compiled into the crt:

strstr total = 1.130000 (5,0,19,13)
strstr_imp total = 1.020000 (5,0,19,13)
strstr_imp2 total = 0.960000 (5,0,19,13)
strstr_c total = 0.365000 (5,0,19,13)

Ouch!  We're back to 3x slower.  But hey, no one said ocaml could compete with asm.  :)

Anyway, as it stands, the ocaml code is 17% slower than the C code at relatively comparable code-simplicity.  I'd say the ocaml code is a bit harder to understand and wordier, but that's mostly because ocaml doesn't have modifyable pointers, and the C code just walks the pointers through the array.  Also, the C has an early-out return that ocaml can't do (without an exception), and that makes things a little simpler as well.

Of course, I'm not saying you can extrapolate anything about real programs from this simple example, but it's interesting to me nonetheless.

Chris

Here's the updated strstr_imp2.  If you really wanted to try to max out the ocaml, we could read the pattern into an int and try to match it a word at a time, etc.

(* "optimized" imperative strstr *)
let strstr_imp2 pat str = 
  let strlen = String.length str in
  let patlen = String.length pat and
      i = ref 0 and
      f = ref strlen in
  let limit = strlen - patlen + 1 in
  while !i < limit do
    let pati = ref 0 and
        j = ref !i in
    while !pati < patlen && 
      String.unsafe_get str !j == String.unsafe_get pat !pati do
      pati := succ !pati;
      j := succ !j;
    done;
    if !pati = patlen then
      begin f := !i; i := strlen end
    else
      i := succ !i
  done;
  !f



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-11 21:07     ` Chris Hecker
  2000-12-11 22:22       ` Gerd Stolpmann
@ 2000-12-12  3:28       ` eijiro_sumii
  2000-12-13  1:12         ` John Prevost
  2000-12-12 10:07       ` Sven LUTHER
  2000-12-14  3:36       ` eijiro_sumii
  3 siblings, 1 reply; 23+ messages in thread
From: eijiro_sumii @ 2000-12-12  3:28 UTC (permalink / raw)
  To: checker; +Cc: caml-list, gerd, sumii

> Any ideas why strstr blows the others away?  What's the libc strstr
> look like?

I have no idea, unfortunately...

> I just looked in the MSVC source and it's a braindead while loop
> (copied below), so it's not like it's doing a fancy Boyer-Moore or
> anything.

I don't know anything about strstr in Sun's strstr, but I checked
strstr in GNU libc.  It is a quite complicated program, but look like
a brute-force algorithm (that is, no Knuth-Morris-Pratt or anything
like that).

> This is exactly the kind of problem on which I'd expect caml to come
> within 10% of c.

That was what I expected, too.

> I'd say I'd do the tests myself, but I don't have a bunch of gene
> sequences laying around.  :)

I've put the current version of my application program at:

  http://www.yl.is.s.u-tokyo.ac.jp/~sumii/tmp/hc.tar.gz

If you like, you're welcome to check it yourself, of course.:) You can
run it as "make ; time ./hc".  The output should look like:

  score = 0.348672
  score = 0.391356
  (snip)
  score = 0.630415

The OCaml function "strstr" is in the file "strstr.ml".

> Okay, I'm curious, so I'll port the code to caml and include it
> below as well (as practice for myself).  Can you try it in your test
> harness?

Sure, but please give me a little time.  This is a kind of part-time,
weekend job for me, and I can't tell when I have time to do it next...

Eijiro



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-11 21:07     ` Chris Hecker
@ 2000-12-11 22:22       ` Gerd Stolpmann
  2000-12-12  5:06         ` Chris Hecker
  2000-12-12  3:28       ` eijiro_sumii
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Gerd Stolpmann @ 2000-12-11 22:22 UTC (permalink / raw)
  To: Chris Hecker, eijiro_sumii, caml-list; +Cc: gerd, sumii

On Mon, 11 Dec 2000, Chris Hecker wrote:
>Any ideas why strstr blows the others away?  

It's the type of brain-dead loop for which C compilers are written.

>Have you tried a purely imperative version?  I'd say I'd do the tests myself,
>but I don't have a bunch of gene sequences laying around.  :)   

It's not always the imperative version which is better. The difference in the
generated code of a functional and an imperative version is often minimal.

>Looking at the asm output from strstr_imp, there's not that much you could do
>to make it better.  Maybe there are a few peephole things, and there are
>probably 30% more instructions than you need in this specific case because the
>code to compare characters goes to the trouble of keeping the ints in the caml
>format even in registers inside the loop (so they're shifted back and forth),
>and it converts the chars from the unsafe_gets to full caml ints, which is
>useless since they just get compared to each other and then dumped.  It might
>be interesting to write a peephole optimizer aimed at peepholing camlopt code,
>and looking for things like this. 
> 
>Can anybody optimize the caml src for strstr_imp? 

let strstr_imp2 pat str =
  let strlen = String.length str in
  let patlen = String.length pat and
    i = ref 0 and
    f = ref strlen in
  let i_limit = strlen - patlen in
  while !i < i_limit do
    let pati = ref 0 and
      j = ref !i in
    while !pati < patlen &&
      String.unsafe_get str !j == String.unsafe_get pat !pati do
	pati := succ !pati;
	j := succ !j;
    done;
    if !pati = patlen then
      begin f := !i; i := strlen end
    else
      i := succ !i
  done;
  !f
;;

Perhaps !pati < patlen can also be removed by introducing a "guard" (modifying
the last character of pat such that the loop runs always into an inequality).
However, this would require extra code managing the temporary modification of
pat, and I guess it is not worth-while for short pats.

Gerd
--
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 100             
64293 Darmstadt     EMail:   gerd@gerd-stolpmann.de
Germany                     
----------------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-10 13:16   ` eijiro_sumii
  2000-12-10 15:39     ` Gerd Stolpmann
@ 2000-12-11 21:07     ` Chris Hecker
  2000-12-11 22:22       ` Gerd Stolpmann
                         ` (3 more replies)
  1 sibling, 4 replies; 23+ messages in thread
From: Chris Hecker @ 2000-12-11 21:07 UTC (permalink / raw)
  To: eijiro_sumii, caml-list; +Cc: gerd, sumii


>The results (execution time in seconds) were as follows.
>  strstr   55.74
>  regexp  154.37
>  OCamlA  302.57
>  OCamlB  129.23

Any ideas why strstr blows the others away?  What's the libc strstr look like?  I just looked in the MSVC source and it's a braindead while loop (copied below), so it's not like it's doing a fancy Boyer-Moore or anything.  This is exactly the kind of problem on which I'd expect caml to come within 10% of c.

Have you tried a purely imperative version?  I'd say I'd do the tests myself, but I don't have a bunch of gene sequences laying around.  :)  

Okay, I'm curious, so I'll port the code to caml and include it below as well (as practice for myself).  Can you try it in your test harness?  I've attached a cheesy test app with your caml, calling msvc's strstr, and my imperative version.  I make no claims of profiling accuracy or even code correctness, but here are the results when run from a DOS box on Win98:

strstr total = 1.060000 (5,0,19,13)
strstr_imp total = 0.510000 (5,0,19,13)
strstr_c total = 0.415000 (5,0,19,13)

(The weird thing is when I ran it with shell-command under emacs the timings were 

strstr total = 0.575000 (5,0,19,13)
strstr_imp total = 0.500000 (5,0,19,13)
strstr_c total = 0.415000 (5,0,19,13)

which I found bizarre since strstr became 2x faster, but I haven't had time to look into it.  Can somebody else try these tests? )

Looking at the asm output from strstr_imp, there's not that much you could do to make it better.  Maybe there are a few peephole things, and there are probably 30% more instructions than you need in this specific case because the code to compare characters goes to the trouble of keeping the ints in the caml format even in registers inside the loop (so they're shifted back and forth), and it converts the chars from the unsafe_gets to full caml ints, which is useless since they just get compared to each other and then dumped.  It might be interesting to write a peephole optimizer aimed at peepholing camlopt code, and looking for things like this.

Can anybody optimize the caml src for strstr_imp?

Chris


----strstrc.c-----


#include "/test/ocaml/lib/caml/mlvalues.h"

/* ripped from msvc crt src */
char * __cdecl strstr (
        const char * str1,
        const char * str2
        )
{
        char *cp = (char *) str1;
        char *s1, *s2;

        if ( !*str2 )
            return((char *)str1);

        while (*cp)
        {
                s1 = cp;
                s2 = (char *) str2;

                while ( *s1 && *s2 && !(*s1-*s2) )
                        s1++, s2++;

                if (!*s2)
                        return(cp);

                cp++;
        }

        return(0);

}

value Camlstrstr( value pat, value str )
{
    char *string = String_val(str);
    return Val_int(strstr(string,String_val(pat))-string);
}

------strstrc.mli-----


external strstr_c: string -> string -> int = "Camlstrstr"

------strstr.ml-------




(* Eijiro Sumii's functional strstr *)
let strstr pat str =
  let patlen = String.length pat in
  let strlen = String.length str in
  let rec is_prefix pos spos epos =
    if pos < 0 then true else
    if String.unsafe_get pat pos <>
      String.unsafe_get str epos then false else
      is_prefix (pos - 1) spos (epos - 1) in
  let rec search spos =
    let epos = spos + patlen - 1 in
    if epos >= strlen then raise Not_found else
    if is_prefix (patlen - 1) spos epos then spos else
    search (spos + 1) in
  search 0
    
(* checker's lame imperative strstr *)
let strstr_imp pat str = 
  let strlen = String.length str in
  let patlen = String.length pat and
      i = ref 0 and
      f = ref strlen in
  while !i < strlen do
    let pati = ref 0 and
        j = ref !i in
    while !pati < patlen && !j < strlen &&
      String.unsafe_get str !j == String.unsafe_get pat !pati do
        pati := succ !pati;
        j := succ !j;
    done;
    if !pati = patlen then
      begin f := !i; i := strlen end
    else
      i := succ !i
  done;
  !f

(* a really questionable timing harness *)
let time_strstr name strstr =
  let t0 = Unix.gettimeofday () in 
  for i = 0 to 100000 do
    strstr "this" "that this the other";
    strstr "this" "this that this the other";
    strstr "this" "that tis the other this";
    strstr "this" "that the othethisr th";
  done;
  let t = Unix.gettimeofday () -. t0 in
  Printf.printf "%s total = %f (%d,%d,%d,%d)\n" name t
  (* a really pathetic regression test *)
    (strstr "this" "that this the other")
    (strstr "this" "this that this the other")
    (strstr "this" "that tis the other this")
    (strstr "this" "that the othethisr th")

(* main *)
let _ = 
  time_strstr "strstr" strstr;
  time_strstr "strstr_imp" strstr_imp;
  time_strstr "strstr_c" Strstrc.strstr_c;



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-10 15:39     ` Gerd Stolpmann
@ 2000-12-11  3:57       ` eijiro_sumii
  2000-12-12 13:58       ` Julian Assange
  1 sibling, 0 replies; 23+ messages in thread
From: eijiro_sumii @ 2000-12-11  3:57 UTC (permalink / raw)
  To: caml-list; +Cc: gerd, sumii

> I did not expect that OCamlB performs so well; so I suggested OcamlA and
> regexp (which are both fast for the bytecode interpreter, too). I think it
> depends also on the problem size (especially on the length of the substring).

Right, I forgot to mention those -- in the benchmark, the "needles"
(matched substrings) were 3 characters long and the "haystacks"
(searched strings) were about 20 characters long; I used the native
code compiler.  For much larger input, regexp could possibly do
better.  Anyway, thank you for the helpful suggestions.

// Eijiro Sumii (http://www.yl.is.s.u-tokyo.ac.jp/~sumii/)
// 
// Ph.D. Student at Department of IS, University of Tokyo
// Visiting Scholar at Department of CIS, University of Pennsylvania



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-10 13:16   ` eijiro_sumii
@ 2000-12-10 15:39     ` Gerd Stolpmann
  2000-12-11  3:57       ` eijiro_sumii
  2000-12-12 13:58       ` Julian Assange
  2000-12-11 21:07     ` Chris Hecker
  1 sibling, 2 replies; 23+ messages in thread
From: Gerd Stolpmann @ 2000-12-10 15:39 UTC (permalink / raw)
  To: eijiro_sumii, caml-list; +Cc: gerd, sumii

On Sun, 10 Dec 2000, eijiro_sumii@anet.ne.jp wrote:
>OK, I benchmarked the following 4 implementations for the purpose of
>comparison.
>
>  strstr
>    stub to call "strstr" in libc
>
>  regexp
>    combination of "Str.regexp_string" and "Str.search_forward"
>
>  OCamlA
>    the simple implementation in OCaml given in the previous message,
>    a little tuned for more efficiency
>
>  OCamlB
>    another straightforward implementation in OCaml, attached at the
>    end of this message
>
>The results (execution time in seconds) were as follows.
>
>  strstr   55.74
>  regexp  154.37
>  OCamlA  302.57
>  OCamlB  129.23

I did not expect that OCamlB performs so well; so I suggested OcamlA and
regexp (which are both fast for the bytecode interpreter, too). I think it
depends also on the problem size (especially on the length of the substring).

Gerd
-- 
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 100             
64293 Darmstadt     EMail:   gerd@gerd-stolpmann.de
Germany                     
----------------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-08 12:57 ` Gerd Stolpmann
@ 2000-12-10 13:16   ` eijiro_sumii
  2000-12-10 15:39     ` Gerd Stolpmann
  2000-12-11 21:07     ` Chris Hecker
  0 siblings, 2 replies; 23+ messages in thread
From: eijiro_sumii @ 2000-12-10 13:16 UTC (permalink / raw)
  To: caml-list; +Cc: gerd, sumii

> You can write it yourself:
...
> If speed is important, I recommend 
> 
> let find_substring s1 s2 =
>   Str.search_forward (Str.regexp (Str.quote s2)) s1 0
> ;;

OK, I benchmarked the following 4 implementations for the purpose of
comparison.

  strstr
    stub to call "strstr" in libc

  regexp
    combination of "Str.regexp_string" and "Str.search_forward"

  OCamlA
    the simple implementation in OCaml given in the previous message,
    a little tuned for more efficiency

  OCamlB
    another straightforward implementation in OCaml, attached at the
    end of this message

The results (execution time in seconds) were as follows.

  strstr   55.74
  regexp  154.37
  OCamlA  302.57
  OCamlB  129.23

The application is a kind of library for data mining of gene
sequences.  It calls the substring matching function more than
400,000,000 times.  The machine is a Sun UltraEnterprise 450 with 4
UltraSPARC II (480 MHz) processors and 2GB main memory, but the
program itself doesn't make any use of multiple processors or much
space.

Although I haven't checked the reasons for these results in detail,
this might hopefully be of some information for other people, too.

And probably I should also consider implementing more sophisticated,
efficient algorithm in OCaml...

// Eijiro Sumii (http://www.yl.is.s.u-tokyo.ac.jp/~sumii/)
// 
// Ph.D. Student at Department of IS, University of Tokyo
// Visiting Scholar at Department of CIS, University of Pennsylvania

----------------------------------------------------------------------
let strstr pat str =
  let patlen = String.length pat in
  let strlen = String.length str in
  let rec is_prefix pos spos epos =
    if pos < 0 then true else
    if String.unsafe_get pat pos <>
      String.unsafe_get str epos then false else
    is_prefix (pos - 1) spos (epos - 1) in
  let rec search spos =
    let epos = spos + patlen - 1 in
    if epos >= strlen then raise Not_found else
    if is_prefix (patlen - 1) spos epos then spos else
    search (spos + 1) in
  search 0
----------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: substring match like "strstr"
  2000-12-08  6:04 eijiro_sumii
@ 2000-12-08 12:57 ` Gerd Stolpmann
  2000-12-10 13:16   ` eijiro_sumii
  0 siblings, 1 reply; 23+ messages in thread
From: Gerd Stolpmann @ 2000-12-08 12:57 UTC (permalink / raw)
  To: eijiro_sumii, caml-list; +Cc: sumii

On Fri, 08 Dec 2000, eijiro_sumii@anet.ne.jp wrote:
>Hi all,
>
>Is there a substring matching function (like "strstr" in libc) in the
>standard OCaml library?  Of course there are many ways to implement it
>(by writing it from scratch, using the Str library, interfacing
>"strstr" in libc, etc.), but they are overkill for my purpose.  So I'm
>wondering whether such a function already exists, but I couldn't find
>it in the manual...

There isn't such a function.

You can write it yourself:

let find_substring s1 s2 =
  (* find s2 in s1, or raise Not_found *)
  if String.length s2 > String.length s1 then raise Not_found;
  let b = String.create (String.length s2) in
  let rec search k =
    if k > String.length s1 - String.length s2 then raise Not_found;
    String.blit ~src:s1 ~src_pos:k ~dst:b ~dst_pos:0 ~len:(String.length s2);
    if b = s2 then
      k
    else search (k+1)
  in
  search 0
;;

This version of find_substring avoids superflous heap allocations, and I think
it is tricky anough to be included into the stdlib. (However, if we had strstr
as primitive, even the allocation of b could be avoided.)

If speed is important, I recommend 

let find_substring s1 s2 =
  Str.search_forward (Str.regexp (Str.quote s2)) s1 0
;;

I hope these solutions aren't overkilling.

Gerd
-- 
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 100             
64293 Darmstadt     EMail:   gerd@gerd-stolpmann.de
Germany                     
----------------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 23+ messages in thread

* substring match like "strstr"
@ 2000-12-08  6:04 eijiro_sumii
  2000-12-08 12:57 ` Gerd Stolpmann
  0 siblings, 1 reply; 23+ messages in thread
From: eijiro_sumii @ 2000-12-08  6:04 UTC (permalink / raw)
  To: caml-list; +Cc: sumii

Hi all,

Is there a substring matching function (like "strstr" in libc) in the
standard OCaml library?  Of course there are many ways to implement it
(by writing it from scratch, using the Str library, interfacing
"strstr" in libc, etc.), but they are overkill for my purpose.  So I'm
wondering whether such a function already exists, but I couldn't find
it in the manual...

Eijiro



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2000-12-15 12:53 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-12-14 21:12 substring match like "strstr" Ruchira Datta
  -- strict thread matches above, loose matches on Subject: below --
2000-12-08  6:04 eijiro_sumii
2000-12-08 12:57 ` Gerd Stolpmann
2000-12-10 13:16   ` eijiro_sumii
2000-12-10 15:39     ` Gerd Stolpmann
2000-12-11  3:57       ` eijiro_sumii
2000-12-12 13:58       ` Julian Assange
2000-12-11 21:07     ` Chris Hecker
2000-12-11 22:22       ` Gerd Stolpmann
2000-12-12  5:06         ` Chris Hecker
2000-12-12 12:28           ` Jean-Christophe Filliatre
2000-12-13 10:02             ` eijiro_sumii
2000-12-13 10:17               ` Eijiro Sumii
2000-12-13 10:53               ` Julian Assange
2000-12-13 13:28                 ` Eijiro Sumii
2000-12-12  3:28       ` eijiro_sumii
2000-12-13  1:12         ` John Prevost
2000-12-13  2:35           ` Chris Hecker
2000-12-12 10:07       ` Sven LUTHER
2000-12-14  3:36       ` eijiro_sumii
2000-12-14  6:48         ` Chris Hecker
2000-12-14  8:02           ` eijiro_sumii
2000-12-14 21:53             ` Stephan Tolksdorf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).