From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <2df4e3af3782344adbb24aae570efef9@vitanuova.com>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] Scaleable mail repositories.
Date: Tue,  8 Nov 2005 19:56:27 +0000
From: rog@vitanuova.com
In-Reply-To: <8ccc8ba40511011429t47bf84a0y293ee9e578d311f8@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
Topicbox-Message-UUID: a91e7f62-ead0-11e9-9d60-3106f5b1d025

> Why search just mail? If you store your mail as files and put in place
> a search engine, the views and searches you want to make will work
> for it all.

that would be nice, but i think it's a bit ambitious for what i'm
looking at currently.  the search engine would have to be quite
intelligent:

1) it would have to be triggered on the arrival of new mail (otherwise
newly arrived messages would not be held in the index)
2) it would have to know which parts of the file system contained
mail messages and MIME parse them (assuming the mail files
were stored in raw format, which seems necessary for digital
signature verification, not to mention efficiency of delivery
and storage).

having just had a brief glance at the description of Google Desktop,
it appears that it probably does all these things.  in fact, given the
special parsing necessary to index different kinds of data, it's
probably irrelevant what format the mailbox is in - it's dealable
with.

i have to say that some kind of "google desktop for plan 9" would be
lovely, but going for mail first is perhaps a more immediately
realisable target.

the first step, anyway, in both cases, is writing the code to do the
inverted index.

i thought i'd write an external search algorithm - i'm most of the way
through an extendable hash implementation (which seems simple and
quick for insertion, but things get more complex when dealing with
large values, and on deletion; i'm not sure of the best way to deal
with block allocation; and more seriously, maybe it's essential to
have an algorithm that can do range (e.g.  prefix) lookups).  any
elegant (read *small*!), nicely implemented, open source libraries out
there that might fit the bill?  a good description of an appropriate
algorithm would do just as well...