Re: [9fans] Scaleable mail repositories.

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* Re: [9fans] Scaleable mail repositories.
@ 2005-10-31 15:33 Fco. J. Ballesteros
  2005-10-31 18:38 ` William Josephson
  0 siblings, 1 reply; 27+ messages in thread
From: Fco. J. Ballesteros @ 2005-10-31 15:33 UTC (permalink / raw)
  To: 9fans

:  5 seconds per 1000 seems a tad slow to me. But I'm picky. It still might 
:  be worth it since you get all the tools. It's certainly faster than mh 
:  ever was.

Yep. I agree. I think that the problem is that you need something else
(appart from grep) to search on big file trees. Any search tool that lets
you lookup file paths by content would work with your mail as well.
We have a home grown program that does that,
but it does not work well enough and
does not deserve distribution.

I have placed a copy of our local /sys/src/cmd/mail2fs at
/n/sources/contrib/nemo/mail2fs, in case anyone wants to experiment before
we clean it up.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 15:33 [9fans] Scaleable mail repositories Fco. J. Ballesteros
@ 2005-10-31 18:38 ` William Josephson
  0 siblings, 0 replies; 27+ messages in thread
From: William Josephson @ 2005-10-31 18:38 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, Oct 31, 2005 at 04:33:29PM +0100, Fco. J. Ballesteros wrote:
> :  5 seconds per 1000 seems a tad slow to me. But I'm picky. It still might 
> :  be worth it since you get all the tools. It's certainly faster than mh 
> :  ever was.
> 
> Yep. I agree. I think that the problem is that you need something else
> (appart from grep) to search on big file trees. Any search tool that lets
> you lookup file paths by content would work with your mail as well.
> We have a home grown program that does that, but it does not work well
> enough and does not deserve distribution.

I know at least one person who uses glimpse for this.
Of course his glimpse index process tended to knock
over one of the Sparc cycle servers with some regularity.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
@ 2005-11-09  9:45 Fco. J. Ballesteros
  2005-11-09 10:24 ` Charles Forsyth
  0 siblings, 1 reply; 27+ messages in thread
From: Fco. J. Ballesteros @ 2005-11-09  9:45 UTC (permalink / raw)
  To: 9fans

:  that would be nice, but i think it's a bit ambitious for what i'm
:  looking at currently.  the search engine would have to be quite
:  intelligent:
:  
:  1) it would have to be triggered on the arrival of new mail (otherwise
:  newly arrived messages would not be held in the index)

It would have to be triggered on the changing of files in the
file system. With some help from the fs, this becomes cheap.

:  2) it would have to know which parts of the file system contained
:  mail messages and MIME parse them (assuming the mail files
:  were stored in raw format, which seems necessary for digital
:  signature verification, not to mention efficiency of delivery
:  and storage).

Don't agree. I store the messages in cooked format. That makes it easy to
understand mime :-)
If you want the raw message for whatever purposes,
you might also keep that thing. 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-09  9:45 Fco. J. Ballesteros
@ 2005-11-09 10:24 ` Charles Forsyth
  2005-11-09 14:19   ` Sam
  0 siblings, 1 reply; 27+ messages in thread
From: Charles Forsyth @ 2005-11-09 10:24 UTC (permalink / raw)
  To: 9fans

> It would have to be triggered on the changing of files in the
> file system. With some help from the fs, this becomes cheap.

i myself don't want the mail system to have to rely on a particular
underlying file system.   you could certainly push an indexing fs in front
of any other, i suppose, but relying on there being a
definite article (`the' file system), as that and similar remarks
imply, doesn't seem right to me in a distributed system where
different filing resources can be bound in on demand.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-09 10:24 ` Charles Forsyth
@ 2005-11-09 14:19   ` Sam
  2005-11-10  1:24     ` erik quanstrom
  0 siblings, 1 reply; 27+ messages in thread
From: Sam @ 2005-11-09 14:19 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

In the not-so-distant past I was part of a three man
effort to write a web site indexer / search engine
generator.  My job was to take the indexed files / urls
(they sucked them down with java) and create a suffix
tree database that could be searched upon via cgi.  I
don't have any specific numbers, but it was quite fast.

This was when google was just becoming known and once
we realized we could point google at a website the
project was abandoned.

The whole point of using suffix trees is linear time
search wrt the size of the search string (note: not
the size of the searched text).  Seems like it's
a good candidate for this task.

Sam

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-09 14:19   ` Sam
@ 2005-11-10  1:24     ` erik quanstrom
  2005-11-10  2:30       ` Russ Cox
  0 siblings, 1 reply; 27+ messages in thread
From: erik quanstrom @ 2005-11-10  1:24 UTC (permalink / raw)
  To: 9fans, Sam

suffix arrays create an index that is bigger than the 
original data. regardless of the theoretical O(1) mumble,
the size of the index is a major drawback.

erik

Sam <sah@softcardsystems.com> writes

| 
| In the not-so-distant past I was part of a three man
| effort to write a web site indexer / search engine
| generator.  My job was to take the indexed files / urls
| (they sucked them down with java) and create a suffix
| tree database that could be searched upon via cgi.  I
| don't have any specific numbers, but it was quite fast.
| 
| This was when google was just becoming known and once
| we realized we could point google at a website the
| project was abandoned.
| 
| The whole point of using suffix trees is linear time
| search wrt the size of the search string (note: not
| the size of the searched text).  Seems like it's
| a good candidate for this task.
| 
| Sam

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-10  1:24     ` erik quanstrom
@ 2005-11-10  2:30       ` Russ Cox
  2005-11-10  6:33         ` Scott Schwartz
  2005-11-10 11:55         ` erik quanstrom
  0 siblings, 2 replies; 27+ messages in thread
From: Russ Cox @ 2005-11-10  2:30 UTC (permalink / raw)
  To: erik quanstrom, Fans of the OS Plan 9 from Bell Labs

> suffix arrays create an index that is bigger than the
> original data. regardless of the theoretical O(1) mumble,
> the size of the index is a major drawback.

That's true, but it depends a lot on the app.
The computational biology guys seem to love them
for indexing large amounts of DNA.

Russ


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-10  2:30       ` Russ Cox
@ 2005-11-10  6:33         ` Scott Schwartz
  2005-11-10 11:55         ` erik quanstrom
  1 sibling, 0 replies; 27+ messages in thread
From: Scott Schwartz @ 2005-11-10  6:33 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

| That's true, but it depends a lot on the app.
| The computational biology guys seem to love them
| for indexing large amounts of DNA.

Yes, but even there it's fair to say that opinion is mixed.  A lot of
really good bioinformatics code (e.g. blastz, megablast, blat) uses hash
table based methods instead.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-10  2:30       ` Russ Cox
  2005-11-10  6:33         ` Scott Schwartz
@ 2005-11-10 11:55         ` erik quanstrom
  1 sibling, 0 replies; 27+ messages in thread
From: erik quanstrom @ 2005-11-10 11:55 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs, Russ Cox

yes, and they've developed some interesting high-performance algorithms, which i've scanned,
but need to take a good look at.

the computational bio guys love it because they have long strings of base pairs that they want
to index. and suffix arrays are the ticket for that. the reason they love suffix arrays is that
there is no natural "word".

text searching would be the opposite. words are the natural unit (in speech there are no letters)
and words are often repeated.

- erik

Russ Cox <rsc@swtch.com> writes

| 
| > suffix arrays create an index that is bigger than the
| > original data. regardless of the theoretical O(1) mumble,
| > the size of the index is a major drawback.
| 
| That's true, but it depends a lot on the app.
| The computational biology guys seem to love them
| for indexing large amounts of DNA.
| 
| Russ

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
@ 2005-10-31 15:19 Fco. J. Ballesteros
  0 siblings, 0 replies; 27+ messages in thread
From: Fco. J. Ballesteros @ 2005-10-31 15:19 UTC (permalink / raw)
  To: 9fans

:  And remember, whatever new scheme wins, it still has to be
:  able to read my 20+ years of mbox format messages too.

It can ;-), it's just a matter of converting them...

for (m in yourmboxes){
	mail2fs -d $home/mail/$m $m
}

I mean, I still use omail (~ acme's Mail) to read mboxes I did not
convert, and convert some of them to the dir hier. fmt. as I use them.
But IMHO, there is no problem with that.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
@ 2005-10-31 15:14 Fco. J. Ballesteros
  2005-10-31 16:22 ` Ronald G Minnich
  0 siblings, 1 reply; 27+ messages in thread
From: Fco. J. Ballesteros @ 2005-10-31 15:14 UTC (permalink / raw)
  To: 9fans

It's implemented. I'm using it :-)
Not a rocket, but this is fast enough I'd say (thanks to fossil/venti):

; cd /usr/nemo/mail/mbox
; ls | wc
   1294    1294    9050	# 1294 mails, one dir per mail
; time grep 'no need to argue' */text 
3376/text:no need to argue.
3377/text:> no need to argue.
0.03u 0.10s 5.07r 	 grep no need to argue 3101/text 3373/text 3376/text ...

:  
:  just run mh 'scan' on 1000 files and make it as fast as the old 'msg' 
:  utility (which I went to from mh) and I'll buy it. MH got so painfully 
:  slow for me that I couldn't take it.
:  
:  But, hey, implement it and let's see .

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 15:14 Fco. J. Ballesteros
@ 2005-10-31 16:22 ` Ronald G Minnich
  2005-10-31 18:37   ` William Josephson
  0 siblings, 1 reply; 27+ messages in thread
From: Ronald G Minnich @ 2005-10-31 16:22 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Fco. J. Ballesteros wrote:
> It's implemented. I'm using it :-)
> Not a rocket, but this is fast enough I'd say (thanks to fossil/venti):

5 seconds per 1000 seems a tad slow to me. But I'm picky. It still might 
be worth it since you get all the tools. It's certainly faster than mh 
ever was.

ron


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 16:22 ` Ronald G Minnich
@ 2005-10-31 18:37   ` William Josephson
  0 siblings, 0 replies; 27+ messages in thread
From: William Josephson @ 2005-10-31 18:37 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, Oct 31, 2005 at 09:22:00AM -0700, Ronald G Minnich wrote:
> Fco. J. Ballesteros wrote:
> >It's implemented. I'm using it :-)
> >Not a rocket, but this is fast enough I'd say (thanks to fossil/venti):
> 
> 5 seconds per 1000 seems a tad slow to me. But I'm picky. It still might 
> be worth it since you get all the tools. It's certainly faster than mh 
> ever was.

More like painfully slow.  From this calendar year alone,
I have over 120,000 messages.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
@ 2005-10-31 11:32 Fco. J. Ballesteros
  2005-10-31 16:01 ` Ronald G Minnich
  0 siblings, 1 reply; 27+ messages in thread
From: Fco. J. Ballesteros @ 2005-10-31 11:32 UTC (permalink / raw)
  To: 9fans

It's easy to write file servers, but that does not mean that it's the
right way to proceed. IMHO, if you want to see your mail as files, and
you have a file server, it's easier to store the mail in that format. All
the code necessary to handle your storage and index structure becomes
fossil/venti, and all that has to be done is to convert from the mbox format
into your preferred archival format, and to feed upas with input messages
for sending. Isn't this more simple and powerful? Or are you thinking of
something else that is best done using the existing format?

:  i don't understand why, now that it's easy to write file servers
:  (compared to unix days), it's necessary to store the mail messages
:  as actual separate files or directories.  the main problem with
:  upas/fs i find is that it rewrites the file instead of treating it
:  as append-only, and it reads the whole thing into memory (in a moderately
:  bulky format); rather than maintaining a separate index file or files,
:  and loading as needed.   both the storage and index structure can
:  then be made suitable for the task.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 11:32 Fco. J. Ballesteros
@ 2005-10-31 16:01 ` Ronald G Minnich
  2005-10-31 15:06   ` jmk
  0 siblings, 1 reply; 27+ messages in thread
From: Ronald G Minnich @ 2005-10-31 16:01 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Fco. J. Ballesteros wrote:
> It's easy to write file servers, but that does not mean that it's the
> right way to proceed. IMHO, if you want to see your mail as files, and
> you have a file server, it's easier to store the mail in that format. All
> the code necessary to handle your storage and index structure becomes
> fossil/venti, and all that has to be done is to convert from the mbox format
> into your preferred archival format, and to feed upas with input messages
> for sending. Isn't this more simple and powerful? Or are you thinking of
> something else that is best done using the existing format?


just run mh 'scan' on 1000 files and make it as fast as the old 'msg' 
utility (which I went to from mh) and I'll buy it. MH got so painfully 
slow for me that I couldn't take it.

But, hey, implement it and let's see .

no need to argue.

ron


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 16:01 ` Ronald G Minnich
@ 2005-10-31 15:06   ` jmk
  0 siblings, 0 replies; 27+ messages in thread
From: jmk @ 2005-10-31 15:06 UTC (permalink / raw)
  To: 9fans

Spot on, Ron.

And remember, whatever new scheme wins, it still has to be
able to read my 20+ years of mbox format messages too.

--jim

On Mon Oct 31 10:02:13 EST 2005, rminnich@lanl.gov wrote:
> Fco. J. Ballesteros wrote:
> > It's easy to write file servers, but that does not mean that it's the
> > right way to proceed. IMHO, if you want to see your mail as files, and
> > you have a file server, it's easier to store the mail in that format. All
> > the code necessary to handle your storage and index structure becomes
> > fossil/venti, and all that has to be done is to convert from the mbox format
> > into your preferred archival format, and to feed upas with input messages
> > for sending. Isn't this more simple and powerful? Or are you thinking of
> > something else that is best done using the existing format?
> 
> 
> just run mh 'scan' on 1000 files and make it as fast as the old 'msg' 
> utility (which I went to from mh) and I'll buy it. MH got so painfully 
> slow for me that I couldn't take it.
> 
> But, hey, implement it and let's see .
> 
> no need to argue.
> 
> ron


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
@ 2005-10-30  1:10 geoff
  2005-10-30  1:18 ` Paul Lalonde
  0 siblings, 1 reply; 27+ messages in thread
From: geoff @ 2005-10-30  1:10 UTC (permalink / raw)
  To: quanstro, 9fans

I don't know either.  Creating a new file for each incoming message
seemed like an obvious thing to do in the mid-1980s, though the
concern then might have been i-node consumption on Unixes.

I co-wrote a message store in Inferno while at the labs that decoded
MIME content-transfer-encodings as it read each message off the
network and decoded each part into a separate file in a directory tree
that reflected the hierarchical structure of the MIME message.
upas/fs came later and does the MIME decoding, but not the breaking up
into separate files upon reception.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30  1:10 [9fans] rfork(RFPROC) and ffork() geoff
@ 2005-10-30  1:18 ` Paul Lalonde
  2005-10-31  4:06   ` [9fans] Scaleable mail repositories Lyndon Nerenberg
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Lalonde @ 2005-10-30  1:18 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs; +Cc: quanstro

I used to keep all my mail this way using MH; it worked well up to  
the point where directories got so full that directory operations  
were taking too long (I remember a nasty n^2 sort in the early days  
of linux).  If Plan9's file handling is up to it, I'm in favour of  
this approach.  Just remember that it has to work well with 10's of  
thousands of emails in a directory - search is making organization  
obsolete.  Heck, some of us didn't manage organization before search  
made it obsolete.

Paul

On 29-Oct-05, at 6:10 PM, geoff@collyer.net wrote:

> I don't know either.  Creating a new file for each incoming message
> seemed like an obvious thing to do in the mid-1980s, though the
> concern then might have been i-node consumption on Unixes.
>
> I co-wrote a message store in Inferno while at the labs that decoded
> MIME content-transfer-encodings as it read each message off the
> network and decoded each part into a separate file in a directory tree
> that reflected the hierarchical structure of the MIME message.
> upas/fs came later and does the MIME decoding, but not the breaking up
> into separate files upon reception.
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [9fans] Scaleable mail repositories.
  2005-10-30  1:18 ` Paul Lalonde
@ 2005-10-31  4:06   ` Lyndon Nerenberg
  2005-10-31 10:55     ` C H Forsyth
  0 siblings, 1 reply; 27+ messages in thread
From: Lyndon Nerenberg @ 2005-10-31  4:06 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Oct 29, 2005, at 6:18 PM, Paul Lalonde wrote:

> I used to keep all my mail this way using MH; it worked well up to  
> the point where directories got so full that directory operations  
> were taking too long (I remember a nasty n^2 sort in the early days  
> of linux).  If Plan9's file handling is up to it, I'm in favour of  
> this approach.  Just remember that it has to work well with 10's of  
> thousands of emails in a directory - search is making organization  
> obsolete.  Heck, some of us didn't manage organization before  
> search made it obsolete.

The trick to making this work well is to do what the Cyrus IMAP  
server does: use the MH style one-message-per-file layout, and keep  
an index cache of the commonly accessed items (from, to, date, mime  
structure).  It's fast, and it scales very well.  When I was at  
Messaging Direct we sold a commercial version of the Cyrus server.   
We later designed our own IMAP server, but we kept the Cyrus file and  
cache layout as it was still the fastest and most scalable solution  
to the problem.

--lyndon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31  4:06   ` [9fans] Scaleable mail repositories Lyndon Nerenberg
@ 2005-10-31 10:55     ` C H Forsyth
  2005-10-31 12:32       ` erik quanstrom
  2005-10-31 15:30       ` jmk
  0 siblings, 2 replies; 27+ messages in thread
From: C H Forsyth @ 2005-10-31 10:55 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 516 bytes --]

i don't understand why, now that it's easy to write file servers
(compared to unix days), it's necessary to store the mail messages
as actual separate files or directories.  the main problem with
upas/fs i find is that it rewrites the file instead of treating it
as append-only, and it reads the whole thing into memory (in a moderately
bulky format); rather than maintaining a separate index file or files,
and loading as needed.   both the storage and index structure can
then be made suitable for the task.

[-- Attachment #2: Type: message/rfc822, Size: 3962 bytes --]

From: Lyndon Nerenberg <lyndon@orthanc.ca>
To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu>
Subject: [9fans] Scaleable mail repositories.
Date: Sun, 30 Oct 2005 20:06:37 -0800
Message-ID: <D99CBD3A-2049-4C0C-913F-93151D7D56EE@orthanc.ca>

On Oct 29, 2005, at 6:18 PM, Paul Lalonde wrote:

> I used to keep all my mail this way using MH; it worked well up to  
> the point where directories got so full that directory operations  
> were taking too long (I remember a nasty n^2 sort in the early days  
> of linux).  If Plan9's file handling is up to it, I'm in favour of  
> this approach.  Just remember that it has to work well with 10's of  
> thousands of emails in a directory - search is making organization  
> obsolete.  Heck, some of us didn't manage organization before  
> search made it obsolete.

The trick to making this work well is to do what the Cyrus IMAP  
server does: use the MH style one-message-per-file layout, and keep  
an index cache of the commonly accessed items (from, to, date, mime  
structure).  It's fast, and it scales very well.  When I was at  
Messaging Direct we sold a commercial version of the Cyrus server.   
We later designed our own IMAP server, but we kept the Cyrus file and  
cache layout as it was still the fastest and most scalable solution  
to the problem.

--lyndon

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 10:55     ` C H Forsyth
@ 2005-10-31 12:32       ` erik quanstrom
  2005-11-01 19:56         ` rog
  2005-10-31 15:30       ` jmk
  1 sibling, 1 reply; 27+ messages in thread
From: erik quanstrom @ 2005-10-31 12:32 UTC (permalink / raw)
  To: 9fans, C H Forsyth

on the other hand, what is the downside of keeping one message
per file? the upside is that no indexing is required.

- erik

C H Forsyth <forsyth@vitanuova.com> writes

| 
| --upas-vddxtudmbktypqdrugeyxvxyrl
| 
| i don't understand why, now that it's easy to write file servers
| (compared to unix days), it's necessary to store the mail messages
| as actual separate files or directories.  the main problem with
| upas/fs i find is that it rewrites the file instead of treating it
| as append-only, and it reads the whole thing into memory (in a moderately
| bulky format); rather than maintaining a separate index file or files,
| and loading as needed.   both the storage and index structure can
| then be made suitable for the task.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 12:32       ` erik quanstrom
@ 2005-11-01 19:56         ` rog
  2005-11-01 22:29           ` Francisco J Ballesteros
  0 siblings, 1 reply; 27+ messages in thread
From: rog @ 2005-11-01 19:56 UTC (permalink / raw)
  To: 9fans

> on the other hand, what is the downside of keeping one message
> per file? the upside is that no indexing is required.

i'd say that an advantage of going for an indexed scheme is that one
could potentially index attributes other than message number.

i've never got around to biting the bullet on this, but i've long
thought that it would be very nice to have a version of upas/fs which
could offer different views onto the same mailbox.  one could
implement a clone-file style filesystem where each line directory
holds a some subset of the messages in the overall mailbox, determined
by writing a control message, e.g.  a regexp restriction on a given
header line.  suitable indexing, and a little extra acme support could
make this a smooth experience.

i keep many of my old mail messages around, and it's painful to search
through them - i usually end up using grep -n, and plumbing the
mailbox file into acme, which has at least the advantage that it
doesn't use up all my memory.  however it's not a particularly
pleasant experience, and i'd love to see something better.

BTW, one advantage of a file-per-message format is that it enables
straightforward annotation of messages without relying on
mailbox-to-index-file consistency.  i don't know how others use mail,
but i'd find some sort of annotation useful (e.g.  read/unread, intent
to reply), and maybe this is a possible reason for changing the
storage format.  i'm not sure though.  reading many files and
directories will inevitably slow things down (a quick estimate on my
current 23MB mbox shows that it would take just over 4 times as many
9P transactions to read the whole thing if each message were stored as
the a separate file).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-01 19:56         ` rog
@ 2005-11-01 22:29           ` Francisco J Ballesteros
  2005-11-08 19:56             ` rog
  0 siblings, 1 reply; 27+ messages in thread
From: Francisco J Ballesteros @ 2005-11-01 22:29 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Why search just mail? If you store your mail as files and put in place
a search engine, the views and searches you want to make will work
for it all.

On 11/1/05, rog@vitanuova.com <rog@vitanuova.com> wrote:
> > on the other hand, what is the downside of keeping one message
> > per file? the upside is that no indexing is required.
>
> i'd say that an advantage of going for an indexed scheme is that one
> could potentially index attributes other than message number.
>
> i've never got around to biting the bullet on this, but i've long
> thought that it would be very nice to have a version of upas/fs which
> could offer different views onto the same mailbox.  one could
> implement a clone-file style filesystem where each line directory
> holds a some subset of the messages in the overall mailbox, determined
> by writing a control message, e.g.  a regexp restriction on a given
> header line.  suitable indexing, and a little extra acme support could
> make this a smooth experience.
>
> i keep many of my old mail messages around, and it's painful to search
> through them - i usually end up using grep -n, and plumbing the
> mailbox file into acme, which has at least the advantage that it
> doesn't use up all my memory.  however it's not a particularly
> pleasant experience, and i'd love to see something better.
>
> BTW, one advantage of a file-per-message format is that it enables
> straightforward annotation of messages without relying on
> mailbox-to-index-file consistency.  i don't know how others use mail,
> but i'd find some sort of annotation useful (e.g.  read/unread, intent
> to reply), and maybe this is a possible reason for changing the
> storage format.  i'm not sure though.  reading many files and
> directories will inevitably slow things down (a quick estimate on my
> current 23MB mbox shows that it would take just over 4 times as many
> 9P transactions to read the whole thing if each message were stored as
> the a separate file).
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-01 22:29           ` Francisco J Ballesteros
@ 2005-11-08 19:56             ` rog
  2005-11-08 23:22               ` Joel Salomon
                                 ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: rog @ 2005-11-08 19:56 UTC (permalink / raw)
  To: 9fans

> Why search just mail? If you store your mail as files and put in place
> a search engine, the views and searches you want to make will work
> for it all.

that would be nice, but i think it's a bit ambitious for what i'm
looking at currently.  the search engine would have to be quite
intelligent:

1) it would have to be triggered on the arrival of new mail (otherwise
newly arrived messages would not be held in the index)
2) it would have to know which parts of the file system contained
mail messages and MIME parse them (assuming the mail files
were stored in raw format, which seems necessary for digital
signature verification, not to mention efficiency of delivery
and storage).

having just had a brief glance at the description of Google Desktop,
it appears that it probably does all these things.  in fact, given the
special parsing necessary to index different kinds of data, it's
probably irrelevant what format the mailbox is in - it's dealable
with.

i have to say that some kind of "google desktop for plan 9" would be
lovely, but going for mail first is perhaps a more immediately
realisable target.

the first step, anyway, in both cases, is writing the code to do the
inverted index.

i thought i'd write an external search algorithm - i'm most of the way
through an extendable hash implementation (which seems simple and
quick for insertion, but things get more complex when dealing with
large values, and on deletion; i'm not sure of the best way to deal
with block allocation; and more seriously, maybe it's essential to
have an algorithm that can do range (e.g.  prefix) lookups).  any
elegant (read *small*!), nicely implemented, open source libraries out
there that might fit the bill?  a good description of an appropriate
algorithm would do just as well...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-08 19:56             ` rog
@ 2005-11-08 23:22               ` Joel Salomon
  2005-11-09  0:51               ` Caerwyn Jones
  2005-11-09  3:32               ` erik quanstrom
  2 siblings, 0 replies; 27+ messages in thread
From: Joel Salomon @ 2005-11-08 23:22 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On 11/8/05, rog@vitanuova.com <rog@vitanuova.com> wrote:
> i have to say that some kind of "google desktop for plan 9" would be
> lovely, but going for mail first is perhaps a more immediately
> realisable target.
>
Gmail for Plan 9, then?

--Joel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-08 19:56             ` rog
  2005-11-08 23:22               ` Joel Salomon
@ 2005-11-09  0:51               ` Caerwyn Jones
  2005-11-09  0:55                 ` Russ Cox
  2005-11-09  3:32               ` erik quanstrom
  2 siblings, 1 reply; 27+ messages in thread
From: Caerwyn Jones @ 2005-11-09  0:51 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> any elegant (read *small*!), nicely implemented, open source libraries out
> there that might fit the bill?

i've had some success at work using lucene (http://lucene.apache.org).
i'd recommend it. i have thought about implementing it in limbo, but
have yet to get around to it.

on inferno i use an inverted index i wrote based on a btree. it's on
my website, called lexis. i've indexed over 5 years of plain text
emails, one per file, and can usually find anything in seconds. it
supports ranges and is general enough to support file annotations,
categories and binary relations.

-caerwyn

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-09  0:51               ` Caerwyn Jones
@ 2005-11-09  0:55                 ` Russ Cox
  0 siblings, 0 replies; 27+ messages in thread
From: Russ Cox @ 2005-11-09  0:55 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> on inferno i use an inverted index i wrote based on a btree. it's on
> my website, called lexis. i've indexed over 5 years of plain text
> emails, one per file, and can usually find anything in seconds. it
> supports ranges and is general enough to support file annotations,
> categories and binary relations.

seconds is too long.  it's really bothering me that gmail takes
seconds to answer my mail searches.  i feel like it used to be
much faster.  if i had a reasonable local interface with a fast
search i'd think about switching back.

russ


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-08 19:56             ` rog
  2005-11-08 23:22               ` Joel Salomon
  2005-11-09  0:51               ` Caerwyn Jones
@ 2005-11-09  3:32               ` erik quanstrom
  2 siblings, 0 replies; 27+ messages in thread
From: erik quanstrom @ 2005-11-09  3:32 UTC (permalink / raw)
  To: 9fans, rog

i don't think it would be that bad.

in 1996 in my former life as an IBMer we put up 30 million documents 
plus the OpenText web index up on the web. we were using OpenText's
"pat" engine "cpl" (aka "cpl") and some other full-text search software to run the
queries. if you haven't heard of these, don't feel bad. they're pretty unremarkable
and no longer exist. performance absolutely bit. partially becuase the interface
between the driver and the engine was xml. (go figure. tim bray was the big
man at opentext.) but the main reason was that the index was set up for
grep-like searches, doing the matching directly against the (patricia-tree'd) text.

i've come to think that's backwards. i think you should scan the corpus for a 
list if unique stemmed terms. (run running Run Run! all considered the same term)
and assign each term an index number. each document can be represnted as a 
string of unique index numbers which can be indexed using normal techniques.
a search would first convert the terms to index numbers and then do the search.
regular expressions could be applied to the /term/ list*, not the corpus.

you could prototype this with 3 tables (term_tab, doc_term_xref, doc_tab) 
from almost any databaseish thing that allows concurrent updates and queries.

obviously there's some generality lost. (proximity searches, whitespace/newline matching.)
but, i think this would get you 80% of what you would want at 20% of the complexity.

so many things to program, so little time.

- erik "mr vaporware" quanstrom

* my quick-and-dirty term check of my own email archive gives ~33000 terms.
(this is a big overcount.)

; cat */* | \
	tr 'A-Z' 'a-z' | \
	sed 's/[][;:"{}~`!@#$%^&*()+=|\\/?<>,.]/ /g' | \
	grep -v '[^a-z0-9]$' | \
	awk '
{
	for (i=1; i<=NF;i++) {
		l = length($i);
		if (l>1 && l<15)
			A[$i]++
	} 
}

END {
	n=0; 
	for(i in A) {
		n++; 
		printf("%d %s\n", A[i], i);
	}
	print n
}' |wc -l
33558

rog@vitanuova.com writes

| i thought i'd write an external search algorithm - i'm most of the way
| through an extendable hash implementation (which seems simple and
| quick for insertion, but things get more complex when dealing with
| large values, and on deletion; i'm not sure of the best way to deal
| with block allocation; and more seriously, maybe it's essential to
| have an algorithm that can do range (e.g.  prefix) lookups).  any
| elegant (read *small*!), nicely implemented, open source libraries out
| there that might fit the bill?  a good description of an appropriate
| algorithm would do just as well...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 10:55     ` C H Forsyth
  2005-10-31 12:32       ` erik quanstrom
@ 2005-10-31 15:30       ` jmk
  1 sibling, 0 replies; 27+ messages in thread
From: jmk @ 2005-10-31 15:30 UTC (permalink / raw)
  To: 9fans

you spoiled it. i was waiting to see how long it took for the
community to come up with that.

On Mon Oct 31 05:57:02 EST 2005, forsyth@vitanuova.com wrote:

> i don't understand why, now that it's easy to write file servers
> (compared to unix days), it's necessary to store the mail messages
> as actual separate files or directories.  the main problem with
> upas/fs i find is that it rewrites the file instead of treating it
> as append-only, and it reads the whole thing into memory (in a moderately
> bulky format); rather than maintaining a separate index file or files,
> and loading as needed.   both the storage and index structure can
> then be made suitable for the task.


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2005-11-10 11:55 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-10-31 15:33 [9fans] Scaleable mail repositories Fco. J. Ballesteros
2005-10-31 18:38 ` William Josephson
  -- strict thread matches above, loose matches on Subject: below --
2005-11-09  9:45 Fco. J. Ballesteros
2005-11-09 10:24 ` Charles Forsyth
2005-11-09 14:19   ` Sam
2005-11-10  1:24     ` erik quanstrom
2005-11-10  2:30       ` Russ Cox
2005-11-10  6:33         ` Scott Schwartz
2005-11-10 11:55         ` erik quanstrom
2005-10-31 15:19 Fco. J. Ballesteros
2005-10-31 15:14 Fco. J. Ballesteros
2005-10-31 16:22 ` Ronald G Minnich
2005-10-31 18:37   ` William Josephson
2005-10-31 11:32 Fco. J. Ballesteros
2005-10-31 16:01 ` Ronald G Minnich
2005-10-31 15:06   ` jmk
2005-10-30  1:10 [9fans] rfork(RFPROC) and ffork() geoff
2005-10-30  1:18 ` Paul Lalonde
2005-10-31  4:06   ` [9fans] Scaleable mail repositories Lyndon Nerenberg
2005-10-31 10:55     ` C H Forsyth
2005-10-31 12:32       ` erik quanstrom
2005-11-01 19:56         ` rog
2005-11-01 22:29           ` Francisco J Ballesteros
2005-11-08 19:56             ` rog
2005-11-08 23:22               ` Joel Salomon
2005-11-09  0:51               ` Caerwyn Jones
2005-11-09  0:55                 ` Russ Cox
2005-11-09  3:32               ` erik quanstrom
2005-10-31 15:30       ` jmk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).