From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <2df4e3af3782344adbb24aae570efef9@vitanuova.com> To: 9fans@cse.psu.edu Subject: Re: [9fans] Scaleable mail repositories. Date: Tue, 8 Nov 2005 19:56:27 +0000 From: rog@vitanuova.com In-Reply-To: <8ccc8ba40511011429t47bf84a0y293ee9e578d311f8@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Topicbox-Message-UUID: a91e7f62-ead0-11e9-9d60-3106f5b1d025 > Why search just mail? If you store your mail as files and put in place > a search engine, the views and searches you want to make will work > for it all. that would be nice, but i think it's a bit ambitious for what i'm looking at currently. the search engine would have to be quite intelligent: 1) it would have to be triggered on the arrival of new mail (otherwise newly arrived messages would not be held in the index) 2) it would have to know which parts of the file system contained mail messages and MIME parse them (assuming the mail files were stored in raw format, which seems necessary for digital signature verification, not to mention efficiency of delivery and storage). having just had a brief glance at the description of Google Desktop, it appears that it probably does all these things. in fact, given the special parsing necessary to index different kinds of data, it's probably irrelevant what format the mailbox is in - it's dealable with. i have to say that some kind of "google desktop for plan 9" would be lovely, but going for mail first is perhaps a more immediately realisable target. the first step, anyway, in both cases, is writing the code to do the inverted index. i thought i'd write an external search algorithm - i'm most of the way through an extendable hash implementation (which seems simple and quick for insertion, but things get more complex when dealing with large values, and on deletion; i'm not sure of the best way to deal with block allocation; and more seriously, maybe it's essential to have an algorithm that can do range (e.g. prefix) lookups). any elegant (read *small*!), nicely implemented, open source libraries out there that might fit the bill? a good description of an appropriate algorithm would do just as well...