From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id TAA18879 for caml-red; Thu, 14 Dec 2000 19:20:01 +0100 (MET) Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id CAA02043 for ; Wed, 13 Dec 2000 02:08:59 +0100 (MET) Received: from localhost ([63.124.128.66]) by nez-perce.inria.fr (8.11.1/8.10.0) with ESMTP id eBD18w515137 for ; Wed, 13 Dec 2000 02:08:58 +0100 (MET) Received: by localhost (Postfix, from userid 1000) id 5F00836D15; Tue, 12 Dec 2000 20:12:10 -0500 (EST) To: eijiro_sumii@anet.ne.jp Cc: checker@d6.com, caml-list@inria.fr, gerd@gerd-stolpmann.de, sumii@venus.is.s.u-tokyo.ac.jp Subject: Re: substring match like "strstr" References: <00120814135508.00625@ice> <20001210221657W.sumii@yl.is.s.u-tokyo.ac.jp> <4.3.2.7.2.20001211103237.00c12100@shell16.ba.best.com> <20001212122807K.sumii@yl.is.s.u-tokyo.ac.jp> From: John Prevost Date: 12 Dec 2000 20:12:09 -0500 In-Reply-To: <20001212122807K.sumii@yl.is.s.u-tokyo.ac.jp> Message-ID: <87ofyhw2dy.fsf@elbereth.pgh.arsdigita.com> User-Agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/21.0.91 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: weis@pauillac.inria.fr >>>>> "es" == eijiro sumii writes: >> Any ideas why strstr blows the others away? What's the libc >> strstr look like? es> I have no idea, unfortunately... I just grabbed the glibc version, and found the following comment at the head of it: /* * My personal strstr() implementation that beats most other algorithms. * Until someone tells me otherwise, I assume that this is the * fastest implementation of strstr() in C. * I deliberately chose not to comment it. You should have at least * as much fun trying to understand it, as I had to write it :-). * * Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de */ Since the text of the function involves not only nasty gotos (particularly frightening is "goto shloop;" but also pointer tricks and register declarations, I'd say it's to be expected that it'll outperform O'Caml. I am, however, seeing if I can figure out the essential algorithm used (if there's really anything smart to it) and will let you know if I come up with anything extra special. The file strstr.c is included below for your edification. I'd argue that any Caml version is likely to a be a wee bit easier to understand. John. /* Return the offset of one string within another. Copyright (C) 1994, 1996, 1997 Free Software Foundation, Inc. This file is part of the GNU C Library. The GNU C Library is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. The GNU C Library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public License for more details. You should have received a copy of the GNU Library General Public License along with the GNU C Library; see the file COPYING.LIB. If not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */ /* * My personal strstr() implementation that beats most other algorithms. * Until someone tells me otherwise, I assume that this is the * fastest implementation of strstr() in C. * I deliberately chose not to comment it. You should have at least * as much fun trying to understand it, as I had to write it :-). * * Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de */ #if HAVE_CONFIG_H # include #endif #if defined _LIBC || defined HAVE_STRING_H # include #endif typedef unsigned chartype; #undef strstr char * strstr (phaystack, pneedle) const char *phaystack; const char *pneedle; { register const unsigned char *haystack, *needle; register chartype b, c; haystack = (const unsigned char *) phaystack; needle = (const unsigned char *) pneedle; b = *needle; if (b != '\0') { haystack--; /* possible ANSI violation */ do { c = *++haystack; if (c == '\0') goto ret0; } while (c != b); c = *++needle; if (c == '\0') goto foundneedle; ++needle; goto jin; for (;;) { register chartype a; register const unsigned char *rhaystack, *rneedle; do { a = *++haystack; if (a == '\0') goto ret0; if (a == b) break; a = *++haystack; if (a == '\0') goto ret0; shloop: } while (a != b); jin: a = *++haystack; if (a == '\0') goto ret0; if (a != c) goto shloop; rhaystack = haystack-- + 1; rneedle = needle; a = *rneedle; if (*rhaystack == a) do { if (a == '\0') goto foundneedle; ++rhaystack; a = *++needle; if (*rhaystack != a) break; if (a == '\0') goto foundneedle; ++rhaystack; a = *++needle; } while (*rhaystack == a); needle = rneedle; /* took the register-poor approach */ if (a == '\0') break; } } foundneedle: return (char*) haystack; ret0: return 0; }