From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id 34AF5BC6B for ; Sat, 28 Apr 2007 12:47:56 +0200 (CEST) Received: from einhorn.in-berlin.de (einhorn.in-berlin.de [192.109.42.8]) by concorde.inria.fr (8.13.6/8.13.6) with ESMTP id l3SAltYY007372 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL) for ; Sat, 28 Apr 2007 12:47:55 +0200 X-Envelope-From: oliver@first.in-berlin.de X-Envelope-To: Received: from first (dslb-088-073-111-021.pools.arcor-ip.net [88.73.111.21]) (authenticated bits=0) by einhorn.in-berlin.de (8.13.6/8.13.6/Debian-1) with ESMTP id l3SAlsI3003725 for ; Sat, 28 Apr 2007 12:47:55 +0200 Received: by first (Postfix, from userid 501) id C10833AE3E3; Sat, 28 Apr 2007 12:47:47 +0200 (CEST) Date: Sat, 28 Apr 2007 12:47:47 +0200 From: Oliver Bandel To: caml-list@inria.fr Subject: Re: [Caml-list] mboxlib reloaded ;-) Message-ID: <20070428104746.GA363@first.in-berlin.de> References: <20070427135425.GA1161@first.in-berlin.de> <20070427162911.GA10099@furbychan.cocan.org> <20070427231220.GA1507@first.in-berlin.de> <1177721646.16582.8.camel@rosella.wigram> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1177721646.16582.8.camel@rosella.wigram> User-Agent: Mutt/1.5.6i X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8 X-Miltered: at concorde with ID 4633265B.001 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; bandel:01 in-berlin:01 reloaded:01 0200,:01 bandel:01 native-code:01 ocamllex:01 ocamllex:01 ocaml:01 lexer:01 buffer:01 buffer:01 alist:01 lexing:01 lexer:01 On Sat, Apr 28, 2007 at 10:54:06AM +1000, skaller wrote: > On Sat, 2007-04-28 at 01:12 +0200, Oliver Bandel wrote: > > > So, I then checked my mboxlib and saw that it is quite slow, > > compared to what I expected ( expect! I did not tried it > > on my development machine because I have nomutt installed there) > > and even if native-code smuch faster, it's nevertheless slow... > > ...so I thought I have to redesign my scanner-stage. > > (I use Str-module and ocamnllex mixed together; maybe > > using a plain selfwritten OCaml-scanner might be better here). > > Ocamllex generates very fast scanner: it is using > a very high-tech tagged deterministic finite state automaton > with a driver written in C (so no boxing etc processing > text buffers). I doubt you can hand code anything as > fast as Ocamllex in C, let alone in Ocaml. I know that ocamllexis fast. But I call ocamllex many many times from my own functions, and this maybe could be done more elegant / with less calls toocamllex, or maybe I should not lex directly from the channel and better read in a bigger chunk of data into memory and then lex on that. Or maybe I should first scan the whole header and then the body for each mail, and only afterwards scan again the header into seperated lines, when it is already in the RAM. > > You should check the size (number of states) of the generated > lexer. How? > It will run faster with small number of states where > the matrix fits easily in the cache. I think that tehere are not so much states, but so many calls. And maybe creating a list of header-entreies is faster than creating strings with buffer module, because I always call Buffer.add_string and so on and so on, instead of puttng the line onto alist. For the about 100MB mbox there are 2.5 * 10^6 calls to to Buffer.add_string for the header and 1.6 * 10^6 calls to Buffer.add_string for the body, 2.6*10^6 calls to the function lexing.engine, ... I better should not read linewise, it seems. And there are maybe other problems, why it might be slow. I let the lexer read in linewise and count the line-number. That is, because I throw an exception, when I detect a broken mbox file (when a mbox-file ends in the middle of a header). So maybe I do too much and to often. I think there are tooo many calls, not too much states of the lexer. (But you could argue that both things are closely related). Ciao, Oliver