From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id SAA06761 for caml-red; Mon, 11 Dec 2000 18:36:30 +0100 (MET) Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id OAA03605 for ; Sun, 10 Dec 2000 14:17:07 +0100 (MET) Received: from dynabook (h128-133.tokyu-net.catv.ne.jp [210.149.128.133]) by nez-perce.inria.fr (8.11.1/8.10.0) with ESMTP id eBADH4T22162 for ; Sun, 10 Dec 2000 14:17:06 +0100 (MET) Received: from dynabook ([127.0.0.1] helo=localhost ident=sumii) by dynabook with esmtp (Exim 3.12 #1 (Debian)) id 1456La-0003zF-00; Sun, 10 Dec 2000 22:16:58 +0900 To: caml-list@inria.fr From: eijiro_sumii@anet.ne.jp Cc: gerd@gerd-stolpmann.de, sumii@venus.is.s.u-tokyo.ac.jp Subject: Re: substring match like "strstr" In-Reply-To: <00120814135508.00625@ice> References: <20001208150443I.sumii@yl.is.s.u-tokyo.ac.jp> <00120814135508.00625@ice> X-Mailer: Mew version 1.94.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: <20001210221657W.sumii@yl.is.s.u-tokyo.ac.jp> Date: Sun, 10 Dec 2000 22:16:57 +0900 X-Dispatcher: imput version 991025(IM133) Sender: weis@pauillac.inria.fr > You can write it yourself: ... > If speed is important, I recommend > > let find_substring s1 s2 = > Str.search_forward (Str.regexp (Str.quote s2)) s1 0 > ;; OK, I benchmarked the following 4 implementations for the purpose of comparison. strstr stub to call "strstr" in libc regexp combination of "Str.regexp_string" and "Str.search_forward" OCamlA the simple implementation in OCaml given in the previous message, a little tuned for more efficiency OCamlB another straightforward implementation in OCaml, attached at the end of this message The results (execution time in seconds) were as follows. strstr 55.74 regexp 154.37 OCamlA 302.57 OCamlB 129.23 The application is a kind of library for data mining of gene sequences. It calls the substring matching function more than 400,000,000 times. The machine is a Sun UltraEnterprise 450 with 4 UltraSPARC II (480 MHz) processors and 2GB main memory, but the program itself doesn't make any use of multiple processors or much space. Although I haven't checked the reasons for these results in detail, this might hopefully be of some information for other people, too. And probably I should also consider implementing more sophisticated, efficient algorithm in OCaml... // Eijiro Sumii (http://www.yl.is.s.u-tokyo.ac.jp/~sumii/) // // Ph.D. Student at Department of IS, University of Tokyo // Visiting Scholar at Department of CIS, University of Pennsylvania ---------------------------------------------------------------------- let strstr pat str = let patlen = String.length pat in let strlen = String.length str in let rec is_prefix pos spos epos = if pos < 0 then true else if String.unsafe_get pat pos <> String.unsafe_get str epos then false else is_prefix (pos - 1) spos (epos - 1) in let rec search spos = let epos = spos + patlen - 1 in if epos >= strlen then raise Not_found else if is_prefix (patlen - 1) spos epos then spos else search (spos + 1) in search 0 ----------------------------------------------------------------------