Re: [Caml-list] mboxlib reloaded ;-)

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Oliver Bandel <oliver@first.in-berlin.de>
To: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] mboxlib reloaded ;-)
Date: Sat, 28 Apr 2007 16:18:12 +0200	[thread overview]
Message-ID: <20070428141812.GD2396@first.in-berlin.de> (raw)
In-Reply-To: <1177768191.11923.24.camel@rosella.wigram>

On Sat, Apr 28, 2007 at 11:49:51PM +1000, skaller wrote:
> On Sat, 2007-04-28 at 13:44 +0200, Oliver Bandel wrote:
> > On Sat, Apr 28, 2007 at 12:54:53PM +0200, Gabriel Kerneis wrote:
> > > Le Sat, 28 Apr 2007 12:47:47 +0200, Oliver Bandel
> > > <oliver@first.in-berlin.de> a écrit :
> > > > > You should check the size (number of states) of the generated
> > > > > lexer.
> > > > 
> > > > How?
> > > 
> > > It's printed out by ocamllex when you run it on you .mll file.
> > > Regards,
> > 
> > Ah, ok. :)
> > 
> > 
> > 18 states, 261 transitions, table size 1152 bytes.
> > 
> > Does not loooks very huge ;-)
> 
> Lol, no it is tiny. You are probably right, too many calls,
> and too much copying data around. AFAIK Ocaml channels also
> add an extra buffer layer (is that right?) so there's even
> more copying.
> 
> Still, although Ocaml may generate more code than C,
> if your code is reasonably tight it should be cached
> and be fast: function calls are actually quite cheap.
> 
> Here's an idea: you said:
> 
> "For the about 100MB mbox there are 2.5 * 10^6 calls to
> to Buffer.add_string for the header and 1.6 * 10^6 calls
> to Buffer.add_string for the body, 2.6*10^6 calls to the
> function lexing.engine, ..."
> 
> How about NOT storing the body text. Instead, just store
> the integer file offset of the first byte and the length?
[...]

This is what I have foreseen for different kinds of
application. In the parts I have implemented already,
I use completely reading of the data, because
this would be the most general way of handling.
I will be able to do fulltext-research and possibly
very complex analysis of the mailtext, and this is the reason
why I read all things into memory.

To use file-oiffsets would make sense, when the main
functionality would be that of a mailreader, where
you might look into the one or the other mail.
Then reading from disk is not a problem.

But when you want to be able to do a lot of different things,
then always again and again reading from disk might
be much slower than once reading data into memory
and then working on it.

But maybe I provide such a functionality also,
and only read to memory, when I need it.

The main intend of the lib (ys, a library, not a certain
application) was to have a tool at hand, that makes
complex things achievable easy.

Example:

=========================================================
open Mbox

let filename = "./MB1"

let _ =
     print_endline "OK, let's start!";
     let sm = read filename in
     let mymatch = match_header_regexp ".*richard" NoCase
       in
         let result = filter_rename mymatch sm "./MB1.ready"
         in
           write_force result;

     print_endline "OK, ready! :)"

=========================================================

Opens the mbox-file "MB1" and writes
all mails with the "richard"-string in the header
to the mbox-file "MB1.ready".

It could also be used for spam-detection,
word-counts, textanalysis, automatic rearranging of
mbox-files, throwing away multiple mails,
or saving the same mail in multiple folders,
if they belong to more than one theme.
Or doing complex data analysis on the mails.
E.g. for statistical reasons of text-analysis,
maybe similarity-calculations of texts, ...

ocamlc:  31.4 seconds
ocamopt:  7.7 seconds

Doing an exit after reading-stage:
ocamlc:   7.0 seconds
ocamlopt: 2.5 seconds

So, the pattern-matching (using Str-module)
takes much more time than the reading-/scanning.

It's about 100 MB mbox with mostly short mails.
(I could provide more statistical infos, using this lib :))

Doing a 
  $ time cat MB1 > MB2
needs 0.2 seconds

 $ time grep -i richard MB1 > MB2
needs 0.148 seconds.

OK, such flat-file applications are always faster,
as they do not read the data to memory and they do no
checks on the validity of the files (no exception on
broken mbox-files).

But maybe it could be done better.

Parsing-stage:
  Str-module not seems to be the fastest, and ocamllex
  creats ml-files that first have to be compiled....

  Is there no certain library in OCaml, which can be used?
  Or do it have to be developed?

  I think on such a kind of thing:
    http://swtch.com/~rsc/regexp/regexp1.html

  A thing like a runtime-ocamllex, that creates datastructures
  instead of ocaml-code, would make sense, IMHO.

  IS No such thing available already?

Ciao,
   Oliver

next prev parent reply	other threads:[~2007-04-28 14:18 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-27 13:54 Oliver Bandel
2007-04-27 16:29 ` [Caml-list] " Richard Jones
2007-04-27 23:12   ` Oliver Bandel
2007-04-28  0:54     ` skaller
2007-04-28 10:47       ` Oliver Bandel
2007-04-28 10:54         ` Gabriel Kerneis
2007-04-28 11:44           ` Oliver Bandel
2007-04-28 13:49             ` skaller
2007-04-28 14:18               ` Oliver Bandel [this message]
2007-04-29 10:45                 ` Richard Jones
2007-04-29 15:41                   ` Oliver Bandel
2007-04-29 18:51                     ` Robert Roessler
2007-05-01 11:00                       ` camomile-problem (Re: [Caml-list] mboxlib reloaded ;-)) Oliver Bandel
2007-05-01 10:56                   ` [Caml-list] mboxlib reloaded ;-) Oliver Bandel
2007-04-28  7:56     ` Richard Jones
2007-04-28 10:58       ` Oliver Bandel
     [not found]         ` <20070429103911.GA30510@furbychan.cocan.org>
2007-04-29 15:43           ` Oliver Bandel
2007-09-24 18:22     ` ocamllex speed [was Re: [Caml-list] mboxlib reloaded ;-)] Bruno De Fraine
2007-09-24 19:54       ` Alain Frisch
2007-09-25  8:53         ` Bruno De Fraine
2007-09-24 22:06       ` skaller
2007-09-27  5:26       ` Chris King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070428141812.GD2396@first.in-berlin.de \
    --to=oliver@first.in-berlin.de \
    --cc=caml-list@yquem.inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).