caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Oliver Bandel <oliver@first.in-berlin.de>
To: caml-list@inria.fr
Subject: Re: [Caml-list] mboxlib reloaded ;-)
Date: Sat, 28 Apr 2007 12:47:47 +0200	[thread overview]
Message-ID: <20070428104746.GA363@first.in-berlin.de> (raw)
In-Reply-To: <1177721646.16582.8.camel@rosella.wigram>

On Sat, Apr 28, 2007 at 10:54:06AM +1000, skaller wrote:
> On Sat, 2007-04-28 at 01:12 +0200, Oliver Bandel wrote:
> 
> > So, I then checked my mboxlib and saw that it is quite slow,
> > compared to what I expected ( expect! I did not tried it
> > on my development machine because I have nomutt installed there)
> > and even if native-code smuch faster, it's nevertheless slow...
> > ...so I thought I have to redesign my scanner-stage.
> > (I use Str-module and ocamnllex mixed together; maybe
> >  using a plain selfwritten  OCaml-scanner might be better here).
> 
> Ocamllex generates very fast scanner: it is using
> a very high-tech tagged deterministic finite state automaton
> with a driver written in C (so no boxing etc processing
> text buffers). I doubt you can hand code anything as
> fast as Ocamllex in C, let alone in Ocaml.

I know that ocamllexis fast.

But I call ocamllex many many times from my
own functions, and this maybe could be done
more elegant / with less calls toocamllex,
or maybe I should not lex directly from the channel
and better read in a bigger chunk of data
into memory and then lex on that.
Or maybe I should first scan the whole header and
then the body for each mail, and only afterwards
scan again the header into seperated lines,
when it is already in the RAM.


> 
> You should check the size (number of states) of the generated
> lexer.

How?

> It will run faster with small number of states where
> the matrix fits easily in the cache.

I think that tehere are not so much states, but so many calls.

And maybe creating a list of header-entreies is faster than
creating strings with buffer module, because I always call
Buffer.add_string and so on and so on, instead of puttng
the line onto alist.

For the about 100MB mbox there are 2.5 * 10^6 calls to
to Buffer.add_string for the header and 1.6 * 10^6 calls
to Buffer.add_string for the body, 2.6*10^6 calls to the
function lexing.engine, ...

I better should not read linewise, it seems.


And there are maybe other problems, why it might be slow.
I let the lexer read in linewise and count the line-number.
That is, because I throw an exception, when I detect a
broken mbox file (when a mbox-file ends in the middle
of a header).

So maybe I do too much and to often.
I think there are tooo many calls, not too much
states of the lexer.

(But you could argue that both things are closely related).


Ciao,
   Oliver


  reply	other threads:[~2007-04-28 10:47 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-27 13:54 Oliver Bandel
2007-04-27 16:29 ` [Caml-list] " Richard Jones
2007-04-27 23:12   ` Oliver Bandel
2007-04-28  0:54     ` skaller
2007-04-28 10:47       ` Oliver Bandel [this message]
2007-04-28 10:54         ` Gabriel Kerneis
2007-04-28 11:44           ` Oliver Bandel
2007-04-28 13:49             ` skaller
2007-04-28 14:18               ` Oliver Bandel
2007-04-29 10:45                 ` Richard Jones
2007-04-29 15:41                   ` Oliver Bandel
2007-04-29 18:51                     ` Robert Roessler
2007-05-01 11:00                       ` camomile-problem (Re: [Caml-list] mboxlib reloaded ;-)) Oliver Bandel
2007-05-01 10:56                   ` [Caml-list] mboxlib reloaded ;-) Oliver Bandel
2007-04-28  7:56     ` Richard Jones
2007-04-28 10:58       ` Oliver Bandel
     [not found]         ` <20070429103911.GA30510@furbychan.cocan.org>
2007-04-29 15:43           ` Oliver Bandel
2007-09-24 18:22     ` ocamllex speed [was Re: [Caml-list] mboxlib reloaded ;-)] Bruno De Fraine
2007-09-24 19:54       ` Alain Frisch
2007-09-25  8:53         ` Bruno De Fraine
2007-09-24 22:06       ` skaller
2007-09-27  5:26       ` Chris King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070428104746.GA363@first.in-berlin.de \
    --to=oliver@first.in-berlin.de \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).