9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] rfork(RFPROC) and ffork()
@ 2005-10-29 15:34 erik quanstrom
  2005-10-29 19:11 ` William Josephson
                   ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: erik quanstrom @ 2005-10-29 15:34 UTC (permalink / raw)
  To: 9fans

i was trying to get faces running under p9p but startproc()
calls rfork with RFPROC. p9rfork() suggests ffork():

; ./o.faces
faces: fork failed: cannot use rfork for shared memory -- use ffork

but i don't see the function in the source, includes or or #define.
there's a file $PLAN9/src/lib9/testfork.c that references ffork()
but it didn't provide any clues

any pointers would be appreciated.

- erik


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-29 15:34 [9fans] rfork(RFPROC) and ffork() erik quanstrom
@ 2005-10-29 19:11 ` William Josephson
  2005-10-29 19:18 ` Russ Cox
  2005-10-31 14:48 ` Russ Cox
  2 siblings, 0 replies; 55+ messages in thread
From: William Josephson @ 2005-10-29 19:11 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Sat, Oct 29, 2005 at 10:34:46AM -0500, erik quanstrom wrote:
> i was trying to get faces running under p9p but startproc()
> calls rfork with RFPROC. p9rfork() suggests ffork():

Ffork predates the current thread library.
You probably want to switch to using the
thread library instead.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-29 15:34 [9fans] rfork(RFPROC) and ffork() erik quanstrom
  2005-10-29 19:11 ` William Josephson
@ 2005-10-29 19:18 ` Russ Cox
  2005-10-29 23:00   ` erik quanstrom
  2005-10-31 14:48 ` Russ Cox
  2 siblings, 1 reply; 55+ messages in thread
From: Russ Cox @ 2005-10-29 19:18 UTC (permalink / raw)
  To: erik quanstrom, Fans of the OS Plan 9 from Bell Labs

Ffork is gone; use libthread instead.  I've updated the message.

As for faces, John Cummings has a working faces.  I'm hoping
he will send it to me so I can include it in the distribution.
He also got upas and acme mail running enough to read
mail (I believe via pop or imap) on his system.  I've checked
in the code to CVS.  It may need some cleaning up (any volunteers?)
and is not built by default, but it's there.  See
src/cmd/upas and src/cmd/acme/mail.

Russ


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-29 19:18 ` Russ Cox
@ 2005-10-29 23:00   ` erik quanstrom
  2005-10-29 23:24     ` Francisco J Ballesteros
  2005-10-29 23:38     ` Russ Cox
  0 siblings, 2 replies; 55+ messages in thread
From: erik quanstrom @ 2005-10-29 23:00 UTC (permalink / raw)
  To: 9fans, Russ Cox

thanks for the information. i'll take a look at upas.

the reason i was looking into faces is because i just wrote
a (not-so-simple-anymore) maildir mail "reader". it's vaguely like nedmail
(from what i've read) except each command is a seperate executable.
e.g.

; thread
mcat $m/1130599988.10533*	# 2005.10.29 10:08   "Russ Cox" <rsc@swtc	[9fans] 386
    mcat $m/1130612580.12213*	# 2005.10.29 14:07   "William Josephson" 
        mcat $m/1130624105.13127*	# 2005.10.29 15:01   jmk@plan9.bell-labs.
            mcat $m/1130624106.13151*	# 2005.10.29 15:06   "Lucio De Re" <lucio

where each line of thread output is ment to be cut-and-pasted.
(there's also fwd, reply, mdesc, attachments, and detach.)
it's a little clunky right now. and it's not a p9p program because
i wanted to use it on a couple of linux machines that don't have p9p
installed and most mail readers require cursor addressing. It could
be, though if i converted the "long" print formats

btw, why does upas use the mbox format rather than a directory?
just wondering.

Russ Cox <rsc@swtch.com> writes

| 
| Ffork is gone; use libthread instead.  I've updated the message.
| 
| As for faces, John Cummings has a working faces.  I'm hoping
| he will send it to me so I can include it in the distribution.
| He also got upas and acme mail running enough to read
| mail (I believe via pop or imap) on his system.  I've checked
| in the code to CVS.  It may need some cleaning up (any volunteers?)
| and is not built by default, but it's there.  See
| src/cmd/upas and src/cmd/acme/mail.
| 
| Russ


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-29 23:00   ` erik quanstrom
@ 2005-10-29 23:24     ` Francisco J Ballesteros
  2005-10-29 23:38     ` Russ Cox
  1 sibling, 0 replies; 55+ messages in thread
From: Francisco J Ballesteros @ 2005-10-29 23:24 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I  have also asked myself why in Plan 9 mail was not kept in a directory
hierarchy.

We have a program that converts from mbox to a directory hierarchy,
leaving attachments as files, uncompressed. Another program presents
a mail interface just by printing summary lines and relative paths to mail
bodies and attachments. Yet another program is used for replying.
I have been using this thing
for a couple of months now and I'm quite happy with the approach.
I´ll be soon cleaning up both pograms, along with others in Plan B, to make
them fit back into Plan 9. Right now I´m cleaning up libraries, and the plan
is to clean up programs after that.

On 10/30/05, erik quanstrom <quanstro@quanstro.net> wrote:
> thanks for the information. i'll take a look at upas.
>
> the reason i was looking into faces is because i just wrote
> a (not-so-simple-anymore) maildir mail "reader". it's vaguely like nedmail
> (from what i've read) except each command is a seperate executable.
> e.g.
>
> ; thread
> mcat $m/1130599988.10533*       # 2005.10.29 10:08   "Russ Cox" <rsc@swtc       [9fans] 386
>     mcat $m/1130612580.12213*   # 2005.10.29 14:07   "William Josephson"
>         mcat $m/1130624105.13127*       # 2005.10.29 15:01   jmk@plan9.bell-labs.
>             mcat $m/1130624106.13151*   # 2005.10.29 15:06   "Lucio De Re" <lucio
>
> where each line of thread output is ment to be cut-and-pasted.
> (there's also fwd, reply, mdesc, attachments, and detach.)
> it's a little clunky right now. and it's not a p9p program because
> i wanted to use it on a couple of linux machines that don't have p9p
> installed and most mail readers require cursor addressing. It could
> be, though if i converted the "long" print formats
>
> btw, why does upas use the mbox format rather than a directory?
> just wondering.
>
> Russ Cox <rsc@swtch.com> writes
>
> |
> | Ffork is gone; use libthread instead.  I've updated the message.
> |
> | As for faces, John Cummings has a working faces.  I'm hoping
> | he will send it to me so I can include it in the distribution.
> | He also got upas and acme mail running enough to read
> | mail (I believe via pop or imap) on his system.  I've checked
> | in the code to CVS.  It may need some cleaning up (any volunteers?)
> | and is not built by default, but it's there.  See
> | src/cmd/upas and src/cmd/acme/mail.
> |
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-29 23:00   ` erik quanstrom
  2005-10-29 23:24     ` Francisco J Ballesteros
@ 2005-10-29 23:38     ` Russ Cox
  2005-10-30  0:19       ` erik quanstrom
  2005-10-30  1:10       ` [9fans] rfork(RFPROC) and ffork() William Josephson
  1 sibling, 2 replies; 55+ messages in thread
From: Russ Cox @ 2005-10-29 23:38 UTC (permalink / raw)
  To: erik quanstrom, Fans of the OS Plan 9 from Bell Labs

> btw, why does upas use the mbox format rather than a directory?
> just wondering.

upas predates mail directories by quite a long time.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-29 23:38     ` Russ Cox
@ 2005-10-30  0:19       ` erik quanstrom
  2005-10-30  1:07         ` Russ Cox
  2005-10-30  1:10         ` geoff
  2005-10-30  1:10       ` [9fans] rfork(RFPROC) and ffork() William Josephson
  1 sibling, 2 replies; 55+ messages in thread
From: erik quanstrom @ 2005-10-30  0:19 UTC (permalink / raw)
  To: 9fans, Russ Cox

that's not the reason i was expecting -- my naïve guess was 1 lockfile is better
than n.

i didn't mean The Maildir Format™®©. especially not the version
courier uses with the utf-7 foldername encoding and the hamster droppings
at the end of the filename to indicate status. i ment a directory with
1 email per file. 

Russ Cox <rsc@swtch.com> writes

| 
| > btw, why does upas use the mbox format rather than a directory?
| > just wondering.
| 
| upas predates mail directories by quite a long time.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30  0:19       ` erik quanstrom
@ 2005-10-30  1:07         ` Russ Cox
  2005-10-30  1:15           ` Ronald G Minnich
                             ` (2 more replies)
  2005-10-30  1:10         ` geoff
  1 sibling, 3 replies; 55+ messages in thread
From: Russ Cox @ 2005-10-30  1:07 UTC (permalink / raw)
  To: erik quanstrom; +Cc: 9fans

i think that in the mime and huge mail box
world, splitting mail out into one message per file
with attachments as separate files themselves
actually makes a lot of sense.  it's easier on the
dump, it's easier for the programmer, it's easier
on the users, etc.  best of all, mime becomes a wire
format instead of a storage format.  users needn't
ever worry about it.

you could still use one lock file to lock the entire
directory.  or not, since you get atomic create for
free.

russ


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30  0:19       ` erik quanstrom
  2005-10-30  1:07         ` Russ Cox
@ 2005-10-30  1:10         ` geoff
  2005-10-30  1:18           ` Paul Lalonde
  1 sibling, 1 reply; 55+ messages in thread
From: geoff @ 2005-10-30  1:10 UTC (permalink / raw)
  To: quanstro, 9fans

I don't know either.  Creating a new file for each incoming message
seemed like an obvious thing to do in the mid-1980s, though the
concern then might have been i-node consumption on Unixes.

I co-wrote a message store in Inferno while at the labs that decoded
MIME content-transfer-encodings as it read each message off the
network and decoded each part into a separate file in a directory tree
that reflected the hierarchical structure of the MIME message.
upas/fs came later and does the MIME decoding, but not the breaking up
into separate files upon reception.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-29 23:38     ` Russ Cox
  2005-10-30  0:19       ` erik quanstrom
@ 2005-10-30  1:10       ` William Josephson
  1 sibling, 0 replies; 55+ messages in thread
From: William Josephson @ 2005-10-30  1:10 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Sat, Oct 29, 2005 at 07:38:46PM -0400, Russ Cox wrote:
> > btw, why does upas use the mbox format rather than a directory?
> > just wondering.
> 
> upas predates mail directories by quite a long time.

And the popularity of MIME which makes using directories
attractive.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30  1:07         ` Russ Cox
@ 2005-10-30  1:15           ` Ronald G Minnich
  2005-10-30  1:22             ` geoff
  2005-10-30  1:54           ` Dave Eckhardt
  2005-10-30  2:24           ` erik quanstrom
  2 siblings, 1 reply; 55+ messages in thread
From: Ronald G Minnich @ 2005-10-30  1:15 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs; +Cc: erik quanstrom

Russ Cox wrote:
> i think that in the mime and huge mail box
> world, splitting mail out into one message per file
> with attachments as separate files themselves
> actually makes a lot of sense.  it's easier on the
> dump, it's easier for the programmer, it's easier
> on the users, etc.  best of all, mime becomes a wire
> format instead of a storage format.  users needn't
> ever worry about it.

yeah but we've come full circle to mh. see: 
http://www.freebsd.org/doc/en_US.ISO8859-1/articles/mh/

and it all seems so nice, and it is so elegant in so many ways, and it 
is all so horribly slow and painful to use. I gave up on mh some time 
ago (let's not talk about it, eh?) because, while I liked the idea, it 
just did not work out at all well in practice.

but mh at the start was simple, then got feature creep. Wonder if you 
could keep the same idea, but not make it a dog this time around? Or 
maybe the machines are so fast now, that mh-like system would feel fast? 
It might be worth a look!

ron


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30  1:10         ` geoff
@ 2005-10-30  1:18           ` Paul Lalonde
  2005-10-30  6:52             ` Skip Tavakkolian
                               ` (3 more replies)
  0 siblings, 4 replies; 55+ messages in thread
From: Paul Lalonde @ 2005-10-30  1:18 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs; +Cc: quanstro

I used to keep all my mail this way using MH; it worked well up to  
the point where directories got so full that directory operations  
were taking too long (I remember a nasty n^2 sort in the early days  
of linux).  If Plan9's file handling is up to it, I'm in favour of  
this approach.  Just remember that it has to work well with 10's of  
thousands of emails in a directory - search is making organization  
obsolete.  Heck, some of us didn't manage organization before search  
made it obsolete.

Paul

On 29-Oct-05, at 6:10 PM, geoff@collyer.net wrote:

> I don't know either.  Creating a new file for each incoming message
> seemed like an obvious thing to do in the mid-1980s, though the
> concern then might have been i-node consumption on Unixes.
>
> I co-wrote a message store in Inferno while at the labs that decoded
> MIME content-transfer-encodings as it read each message off the
> network and decoded each part into a separate file in a directory tree
> that reflected the hierarchical structure of the MIME message.
> upas/fs came later and does the MIME decoding, but not the breaking up
> into separate files upon reception.
>
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30  1:15           ` Ronald G Minnich
@ 2005-10-30  1:22             ` geoff
  2005-10-30  1:58               ` jmk
  0 siblings, 1 reply; 55+ messages in thread
From: geoff @ 2005-10-30  1:22 UTC (permalink / raw)
  To: 9fans

I used MH 4 on Unix for years and don't recall performance being bad
enough to bother me.  On slower machines, it was often faster to read
incoming mail with MH than to start up /bin/mail to read a Giant Unix
Mailbox.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30  1:07         ` Russ Cox
  2005-10-30  1:15           ` Ronald G Minnich
@ 2005-10-30  1:54           ` Dave Eckhardt
  2005-10-30  2:24           ` erik quanstrom
  2 siblings, 0 replies; 55+ messages in thread
From: Dave Eckhardt @ 2005-10-30  1:54 UTC (permalink / raw)
  To: 9fans

> i think that in the mime and huge mail box
> world, splitting mail out into one message
> per file with attachments as separate files
> themselves actually makes a lot of sense.

I think it would be attractive to retain at least
the ability to reconstruct the message exactly as
it came off the wire, especially as s/mime and
DomainKeys seem to be catching on and it seems like
signature verification would be difficult otherwise.

Dave Eckhardt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30  1:22             ` geoff
@ 2005-10-30  1:58               ` jmk
  0 siblings, 0 replies; 55+ messages in thread
From: jmk @ 2005-10-30  1:58 UTC (permalink / raw)
  To: 9fans

The biggest problem we have right now is machines running out
of memory because of all the huge mail procs. Recent changes in
the corporate environment mean there are more of them now on
our servers. You can make disc as cheap as you like but it doesn't
help if people pull huge mailboxes into memory.It'd be nice if
there was a solution to that.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30  1:07         ` Russ Cox
  2005-10-30  1:15           ` Ronald G Minnich
  2005-10-30  1:54           ` Dave Eckhardt
@ 2005-10-30  2:24           ` erik quanstrom
  2005-10-30  2:51             ` geoff
  2 siblings, 1 reply; 55+ messages in thread
From: erik quanstrom @ 2005-10-30  2:24 UTC (permalink / raw)
  To: Russ Cox; +Cc: 9fans

i first killed a tree with the mime spec in 1991. i filed it away and 
read mail with /bin/cat and later sam. but since mime is here to stay
and quoted-printable and base64 (not to mention rfc2047) are really
annoying to read via /bin/cat, i broke down and wrote something.

it would be nice if mime could be just a wire format. but, there's 
a lot of information there that would be hard to put into just files.
just dropping the Content-Transfer-Encoding on the floor might
be a Good Thing. but what about the Content-Type. especially the
charset=. you might not have a suitable charset convert handy when
you get the mail. and some of them are screwed up. if you have the
mime headers you might be able to figure it out. i got "8859-15"
receintly. they wanted "ISO-8859-15". do you want to keep that with 
the email?  how do you tell multipart/alt from multipart/mixed 
without some sort of guide file?

maybe i'm overreacting to the problem, but i'm afraid that 
"just piling it into the fs" will result in reinventing
mime in a different sort of way.

i am a big fan of how upas mounts the email. maybe an
additional set of tools like, e.g., mcat, mimedesc, xheader
might just as easy? 

maybe the best idea would be to leave the original email, 1 per file
on disk and let the mail program present a directory structure for 
each.

-- erik

Russ Cox <rsc@swtch.com> writes

| 
| i think that in the mime and huge mail box
| world, splitting mail out into one message per file
| with attachments as separate files themselves
| actually makes a lot of sense.  it's easier on the
| dump, it's easier for the programmer, it's easier
| on the users, etc.  best of all, mime becomes a wire
| format instead of a storage format.  users needn't
| ever worry about it.
| 
| you could still use one lock file to lock the entire
| directory.  or not, since you get atomic create for
| free.
| 
| russ


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30  2:24           ` erik quanstrom
@ 2005-10-30  2:51             ` geoff
  0 siblings, 0 replies; 55+ messages in thread
From: geoff @ 2005-10-30  2:51 UTC (permalink / raw)
  To: quanstro, 9fans

Given how big some of the MIME parts can be, I prefer not to keep
copying them around.  People mail around video clips and presumably
some day will do the same with entire movies.

We treated MIME as a wire protocol and converted the directory tree to
a MIME stream at the last second before transmission and applied a
content-transfer-encoding if it was going out via SMTP (we had an
alternative protocol, RSMTP, that didn't require encodings).



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30  1:18           ` Paul Lalonde
@ 2005-10-30  6:52             ` Skip Tavakkolian
  2005-10-30 10:14             ` Francisco J Ballesteros
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 55+ messages in thread
From: Skip Tavakkolian @ 2005-10-30  6:52 UTC (permalink / raw)
  To: 9fans

> I used to keep all my mail this way using MH; it worked well up to  
> the point where directories got so full that directory operations  
> were taking too long (I remember a nasty n^2 sort in the early days  
> of linux).  If Plan9's file handling is up to it, I'm in favour of  
> this approach.  Just remember that it has to work well with 10's of  
> thousands of emails in a directory - search is making organization  
> obsolete.  Heck, some of us didn't manage organization before search  
> made it obsolete.

for complex queries, grep's speed is very good; i keep lots of messages
around.  regarding searching/organization, if google could integrate
their search engine with a scanner/paper shredder, i would start getting
organized tomorrow.

cpu% time mgrep plalonde MH
I used to keep all my mail this way using MH; it worked well up to  
/mail/fs/mbox/1796
0.12u 1.19s 2.87r 	 mgrep plalonde MH
cpu% cat /bin/mgrep
#! /bin/rc

if (test ($#* -lt 1) -o ($#* -gt 2)) {
	echo 'usage: mgrep [from] regex'
	exit
}

if (test ! -d /mail/fs/mbox) {
	echo '/mail/fs/mbox does not exist'
	exit
}

if (~ $#* 1) {
	grep $1 /mail/fs/mbox/*/body
	exit $?
}

for (i in `{grep $1 /mail/fs/mbox/*/from}) {
	d=`{basename -d $i}
	if (grep $2 $d/body) {
		echo $d
	}
}



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30  1:18           ` Paul Lalonde
  2005-10-30  6:52             ` Skip Tavakkolian
@ 2005-10-30 10:14             ` Francisco J Ballesteros
  2005-10-30 15:17             ` Russ Cox
  2005-10-31  4:06             ` [9fans] Scaleable mail repositories Lyndon Nerenberg
  3 siblings, 0 replies; 55+ messages in thread
From: Francisco J Ballesteros @ 2005-10-30 10:14 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

To avoid huge directories, we "archive" mails. Archiving acutally moves
mails, say, from a whole month, into a mbox.Month directory. After months
using this approach, I´ve seen no problem regarding slow downs on big
directories.

The need for OEXCL  can be avoided by storing each mail in a directory named
with a random number.  In our case, the dir contains a "text" file
that can be open in
acme/omero to read the mail, and one file (or dir, if it´s a mail) per attach.
Only the real system mbox has to be OEXCL, to prevent races between
incomming mail and mail2fs processes. httlp

If there's interest in trying this, I can move the mail2fs cleanup to a higher
place in my todo list.

hth


On 10/30/05, Paul Lalonde <plalonde@telus.net> wrote:
> I used to keep all my mail this way using MH; it worked well up to
> the point where directories got so full that directory operations
> were taking too long (I remember a nasty n^2 sort in the early days
> of linux).  If Plan9's file handling is up to it, I'm in favour of
> this approach.  Just remember that it has to work well with 10's of
> thousands of emails in a directory - search is making organization
> obsolete.  Heck, some of us didn't manage organization before search
> made it obsolete.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30  1:18           ` Paul Lalonde
  2005-10-30  6:52             ` Skip Tavakkolian
  2005-10-30 10:14             ` Francisco J Ballesteros
@ 2005-10-30 15:17             ` Russ Cox
  2005-10-30 23:00               ` Dave Eckhardt
  2005-10-31  4:06             ` [9fans] Scaleable mail repositories Lyndon Nerenberg
  3 siblings, 1 reply; 55+ messages in thread
From: Russ Cox @ 2005-10-30 15:17 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> I used to keep all my mail this way using MH; it worked well up to
> the point where directories got so full that directory operations
> were taking too long (I remember a nasty n^2 sort in the early days
> of linux).  If Plan9's file handling is up to it, I'm in favour of
> this approach.  Just remember that it has to work well with 10's of
> thousands of emails in a directory - search is making organization
> obsolete.  Heck, some of us didn't manage organization before search
> made it obsolete.

Given that the current mbox file format doesn't scale well to tens
of thousands of emails, I'm not worried about the fact that the
file system doesn't either.

Russ


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30 15:17             ` Russ Cox
@ 2005-10-30 23:00               ` Dave Eckhardt
  2005-10-30 23:14                 ` George Michaelson
                                   ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Dave Eckhardt @ 2005-10-30 23:00 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> Given that the current mbox file format doesn't scale well to
> tens of thousands of emails, I'm not worried about the fact that
> the file system doesn't either.

The standard hack (did MH ever pick this up?) is to split the
directory, e.g., message 32456 is stored in 32/456 or 3/2/456.

The huge storage/locality win for separating out large attachments
if you have venti-like storage (or want to cache virus-scan results)
may well justify tearing the file apart on nextpart boundaries--my
only concern is that the process be 100% reversible by upas when you
ask.  Heck, it could be as simple as "each message is 1 directory,
containing chunks stored one per file, named by integers, and the
original message is formed by cat'ing the chunks in numerical order".
That would let the storing agent decide whether the nextpart boundary
foo belongs in the same chunk as the bits or in its own micro-chunk.

Dave Eckhardt

P.S. At present upas does the >From thing, which isn't reversible.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30 23:00               ` Dave Eckhardt
@ 2005-10-30 23:14                 ` George Michaelson
  2005-10-31  2:15                 ` erik quanstrom
  2005-10-31  4:20                 ` Lyndon Nerenberg
  2 siblings, 0 replies; 55+ messages in thread
From: George Michaelson @ 2005-10-30 23:14 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Sun, 30 Oct 2005 18:00:19 -0500
Dave Eckhardt <davide+p9@cs.cmu.edu> wrote:

> > Given that the current mbox file format doesn't scale well to
> > tens of thousands of emails, I'm not worried about the fact that
> > the file system doesn't either.
> 
> The standard hack (did MH ever pick this up?) is to split the
> directory, e.g., message 32456 is stored in 32/456 or 3/2/456.

nmh didn't pick it up. I wish, oh how I wish it had. 

-George


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30 23:00               ` Dave Eckhardt
  2005-10-30 23:14                 ` George Michaelson
@ 2005-10-31  2:15                 ` erik quanstrom
  2005-10-31  2:33                   ` geoff
  2005-10-31  3:23                   ` Skip Tavakkolian
  2005-10-31  4:20                 ` Lyndon Nerenberg
  2 siblings, 2 replies; 55+ messages in thread
From: erik quanstrom @ 2005-10-31  2:15 UTC (permalink / raw)
  To: 9fans, Dave Eckhardt

at the risk of repeating myself, this is what i was 
trying to get at earlier. i don't think that inventing 
"filesystem mime" for performance reasons makes sense.

what percentage of email even has attachments?

never mind a having a strictly reversable function such that
f*(f(email)) → email. in order to preserve the basic difference between
multipart/alternative and multipart/mixed, one would need a 
metadata file containing that information. in some sort of new format.

- erik

 Dave Eckhardt <davide+p9@cs.cmu.edu> writes

| 
| > Given that the current mbox file format doesn't scale well to
| > tens of thousands of emails, I'm not worried about the fact that
| > the file system doesn't either.
| 
| The standard hack (did MH ever pick this up?) is to split the
| directory, e.g., message 32456 is stored in 32/456 or 3/2/456.
| 
| The huge storage/locality win for separating out large attachments
| if you have venti-like storage (or want to cache virus-scan results)
| may well justify tearing the file apart on nextpart boundaries--my
| only concern is that the process be 100% reversible by upas when you
| ask.  Heck, it could be as simple as "each message is 1 directory,
| containing chunks stored one per file, named by integers, and the
| original message is formed by cat'ing the chunks in numerical order".
| That would let the storing agent decide whether the nextpart boundary
| foo belongs in the same chunk as the bits or in its own micro-chunk.
| 
| Dave Eckhardt
| 
| P.S. At present upas does the >From thing, which isn't reversible.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-31  2:15                 ` erik quanstrom
@ 2005-10-31  2:33                   ` geoff
  2005-10-31  3:23                   ` Skip Tavakkolian
  1 sibling, 0 replies; 55+ messages in thread
From: geoff @ 2005-10-31  2:33 UTC (permalink / raw)
  To: 9fans

Yeah, we did that too: there was a disk file per message that
described the mime parts, including mime type, and that information
was available through the message store's file system interface.
There was enough there to reconstruct the original message.  This was
a working system; we worked out the various issues.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-31  2:15                 ` erik quanstrom
  2005-10-31  2:33                   ` geoff
@ 2005-10-31  3:23                   ` Skip Tavakkolian
  1 sibling, 0 replies; 55+ messages in thread
From: Skip Tavakkolian @ 2005-10-31  3:23 UTC (permalink / raw)
  To: quanstro, 9fans

> never mind a having a strictly reversable function such that
> f*(f(email)) → email. in order to preserve the basic difference between
> multipart/alternative and multipart/mixed, one would need a 
> metadata file containing that information. in some sort of new format.

maybe have an option to retain the */raw file?



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [9fans] Scaleable mail repositories.
  2005-10-30  1:18           ` Paul Lalonde
                               ` (2 preceding siblings ...)
  2005-10-30 15:17             ` Russ Cox
@ 2005-10-31  4:06             ` Lyndon Nerenberg
  2005-10-31 10:55               ` C H Forsyth
  3 siblings, 1 reply; 55+ messages in thread
From: Lyndon Nerenberg @ 2005-10-31  4:06 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs


On Oct 29, 2005, at 6:18 PM, Paul Lalonde wrote:

> I used to keep all my mail this way using MH; it worked well up to  
> the point where directories got so full that directory operations  
> were taking too long (I remember a nasty n^2 sort in the early days  
> of linux).  If Plan9's file handling is up to it, I'm in favour of  
> this approach.  Just remember that it has to work well with 10's of  
> thousands of emails in a directory - search is making organization  
> obsolete.  Heck, some of us didn't manage organization before  
> search made it obsolete.

The trick to making this work well is to do what the Cyrus IMAP  
server does: use the MH style one-message-per-file layout, and keep  
an index cache of the commonly accessed items (from, to, date, mime  
structure).  It's fast, and it scales very well.  When I was at  
Messaging Direct we sold a commercial version of the Cyrus server.   
We later designed our own IMAP server, but we kept the Cyrus file and  
cache layout as it was still the fastest and most scalable solution  
to the problem.

--lyndon


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-30 23:00               ` Dave Eckhardt
  2005-10-30 23:14                 ` George Michaelson
  2005-10-31  2:15                 ` erik quanstrom
@ 2005-10-31  4:20                 ` Lyndon Nerenberg
  2005-10-31 21:31                   ` Dave Eckhardt
  2 siblings, 1 reply; 55+ messages in thread
From: Lyndon Nerenberg @ 2005-10-31  4:20 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs


On Oct 30, 2005, at 3:00 PM, Dave Eckhardt wrote:

> Heck, it could be as simple as "each message is 1 directory,
> containing chunks stored one per file, named by integers, and the
> original message is formed by cat'ing the chunks in numerical order".
> That would let the storing agent decide whether the nextpart boundary
> foo belongs in the same chunk as the bits or in its own micro-chunk.

This turns out to be a pessimization.  At MD I prototyped this in one  
of our servers, and it ended up being noticeably slower than storing  
the entire message in one file and maintaining a cache of the MIME  
structure.

--lyndon


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31  4:06             ` [9fans] Scaleable mail repositories Lyndon Nerenberg
@ 2005-10-31 10:55               ` C H Forsyth
  2005-10-31 12:32                 ` erik quanstrom
  2005-10-31 15:30                 ` jmk
  0 siblings, 2 replies; 55+ messages in thread
From: C H Forsyth @ 2005-10-31 10:55 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 516 bytes --]

i don't understand why, now that it's easy to write file servers
(compared to unix days), it's necessary to store the mail messages
as actual separate files or directories.  the main problem with
upas/fs i find is that it rewrites the file instead of treating it
as append-only, and it reads the whole thing into memory (in a moderately
bulky format); rather than maintaining a separate index file or files,
and loading as needed.   both the storage and index structure can
then be made suitable for the task.

[-- Attachment #2: Type: message/rfc822, Size: 3962 bytes --]

From: Lyndon Nerenberg <lyndon@orthanc.ca>
To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu>
Subject: [9fans] Scaleable mail repositories.
Date: Sun, 30 Oct 2005 20:06:37 -0800
Message-ID: <D99CBD3A-2049-4C0C-913F-93151D7D56EE@orthanc.ca>


On Oct 29, 2005, at 6:18 PM, Paul Lalonde wrote:

> I used to keep all my mail this way using MH; it worked well up to  
> the point where directories got so full that directory operations  
> were taking too long (I remember a nasty n^2 sort in the early days  
> of linux).  If Plan9's file handling is up to it, I'm in favour of  
> this approach.  Just remember that it has to work well with 10's of  
> thousands of emails in a directory - search is making organization  
> obsolete.  Heck, some of us didn't manage organization before  
> search made it obsolete.

The trick to making this work well is to do what the Cyrus IMAP  
server does: use the MH style one-message-per-file layout, and keep  
an index cache of the commonly accessed items (from, to, date, mime  
structure).  It's fast, and it scales very well.  When I was at  
Messaging Direct we sold a commercial version of the Cyrus server.   
We later designed our own IMAP server, but we kept the Cyrus file and  
cache layout as it was still the fastest and most scalable solution  
to the problem.

--lyndon

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 10:55               ` C H Forsyth
@ 2005-10-31 12:32                 ` erik quanstrom
  2005-11-01 19:56                   ` rog
  2005-10-31 15:30                 ` jmk
  1 sibling, 1 reply; 55+ messages in thread
From: erik quanstrom @ 2005-10-31 12:32 UTC (permalink / raw)
  To: 9fans, C H Forsyth

on the other hand, what is the downside of keeping one message
per file? the upside is that no indexing is required.

- erik

C H Forsyth <forsyth@vitanuova.com> writes

| 
| --upas-vddxtudmbktypqdrugeyxvxyrl
| 
| i don't understand why, now that it's easy to write file servers
| (compared to unix days), it's necessary to store the mail messages
| as actual separate files or directories.  the main problem with
| upas/fs i find is that it rewrites the file instead of treating it
| as append-only, and it reads the whole thing into memory (in a moderately
| bulky format); rather than maintaining a separate index file or files,
| and loading as needed.   both the storage and index structure can
| then be made suitable for the task.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-29 15:34 [9fans] rfork(RFPROC) and ffork() erik quanstrom
  2005-10-29 19:11 ` William Josephson
  2005-10-29 19:18 ` Russ Cox
@ 2005-10-31 14:48 ` Russ Cox
  2 siblings, 0 replies; 55+ messages in thread
From: Russ Cox @ 2005-10-31 14:48 UTC (permalink / raw)
  To: erik quanstrom, Fans of the OS Plan 9 from Bell Labs

> i was trying to get faces running under p9p but startproc()
> calls rfork with RFPROC. p9rfork() suggests ffork():

now in cvs.  thanks again to john cummings.
russ


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 10:55               ` C H Forsyth
  2005-10-31 12:32                 ` erik quanstrom
@ 2005-10-31 15:30                 ` jmk
  1 sibling, 0 replies; 55+ messages in thread
From: jmk @ 2005-10-31 15:30 UTC (permalink / raw)
  To: 9fans

you spoiled it. i was waiting to see how long it took for the
community to come up with that.

On Mon Oct 31 05:57:02 EST 2005, forsyth@vitanuova.com wrote:

> i don't understand why, now that it's easy to write file servers
> (compared to unix days), it's necessary to store the mail messages
> as actual separate files or directories.  the main problem with
> upas/fs i find is that it rewrites the file instead of treating it
> as append-only, and it reads the whole thing into memory (in a moderately
> bulky format); rather than maintaining a separate index file or files,
> and loading as needed.   both the storage and index structure can
> then be made suitable for the task.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] rfork(RFPROC) and ffork()
  2005-10-31  4:20                 ` Lyndon Nerenberg
@ 2005-10-31 21:31                   ` Dave Eckhardt
  0 siblings, 0 replies; 55+ messages in thread
From: Dave Eckhardt @ 2005-10-31 21:31 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

>> Heck, it could be as simple as "each message is 1 directory,
>> containing chunks stored one per file, named by integers, and the
>> original message is formed by cat'ing the chunks in numerical
>> order".

> This turns out to be a pessimization.

Of what?

One possible goal (not the only consideration) would be for
the N copies of a monster attachment received by your N
users to occupy space==1 instead of space==N.  Cyrus IMAP
does this in certain circumstances.  Fossil+Venti will do
it automatically, but only (at present) if the attachment
is aligned the same way in each user's mailbox, which it
frequently won't be...or if it's stored out of line in a
separate file, which would align it the same way for all N
users.

Maybe it's not worth worrying about.

Dave Eckhardt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 12:32                 ` erik quanstrom
@ 2005-11-01 19:56                   ` rog
  2005-11-01 22:29                     ` Francisco J Ballesteros
  0 siblings, 1 reply; 55+ messages in thread
From: rog @ 2005-11-01 19:56 UTC (permalink / raw)
  To: 9fans

> on the other hand, what is the downside of keeping one message
> per file? the upside is that no indexing is required.

i'd say that an advantage of going for an indexed scheme is that one
could potentially index attributes other than message number.

i've never got around to biting the bullet on this, but i've long
thought that it would be very nice to have a version of upas/fs which
could offer different views onto the same mailbox.  one could
implement a clone-file style filesystem where each line directory
holds a some subset of the messages in the overall mailbox, determined
by writing a control message, e.g.  a regexp restriction on a given
header line.  suitable indexing, and a little extra acme support could
make this a smooth experience.

i keep many of my old mail messages around, and it's painful to search
through them - i usually end up using grep -n, and plumbing the
mailbox file into acme, which has at least the advantage that it
doesn't use up all my memory.  however it's not a particularly
pleasant experience, and i'd love to see something better.

BTW, one advantage of a file-per-message format is that it enables
straightforward annotation of messages without relying on
mailbox-to-index-file consistency.  i don't know how others use mail,
but i'd find some sort of annotation useful (e.g.  read/unread, intent
to reply), and maybe this is a possible reason for changing the
storage format.  i'm not sure though.  reading many files and
directories will inevitably slow things down (a quick estimate on my
current 23MB mbox shows that it would take just over 4 times as many
9P transactions to read the whole thing if each message were stored as
the a separate file).



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-01 19:56                   ` rog
@ 2005-11-01 22:29                     ` Francisco J Ballesteros
  2005-11-08 19:56                       ` rog
  0 siblings, 1 reply; 55+ messages in thread
From: Francisco J Ballesteros @ 2005-11-01 22:29 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Why search just mail? If you store your mail as files and put in place
a search engine, the views and searches you want to make will work
for it all.

On 11/1/05, rog@vitanuova.com <rog@vitanuova.com> wrote:
> > on the other hand, what is the downside of keeping one message
> > per file? the upside is that no indexing is required.
>
> i'd say that an advantage of going for an indexed scheme is that one
> could potentially index attributes other than message number.
>
> i've never got around to biting the bullet on this, but i've long
> thought that it would be very nice to have a version of upas/fs which
> could offer different views onto the same mailbox.  one could
> implement a clone-file style filesystem where each line directory
> holds a some subset of the messages in the overall mailbox, determined
> by writing a control message, e.g.  a regexp restriction on a given
> header line.  suitable indexing, and a little extra acme support could
> make this a smooth experience.
>
> i keep many of my old mail messages around, and it's painful to search
> through them - i usually end up using grep -n, and plumbing the
> mailbox file into acme, which has at least the advantage that it
> doesn't use up all my memory.  however it's not a particularly
> pleasant experience, and i'd love to see something better.
>
> BTW, one advantage of a file-per-message format is that it enables
> straightforward annotation of messages without relying on
> mailbox-to-index-file consistency.  i don't know how others use mail,
> but i'd find some sort of annotation useful (e.g.  read/unread, intent
> to reply), and maybe this is a possible reason for changing the
> storage format.  i'm not sure though.  reading many files and
> directories will inevitably slow things down (a quick estimate on my
> current 23MB mbox shows that it would take just over 4 times as many
> 9P transactions to read the whole thing if each message were stored as
> the a separate file).
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-01 22:29                     ` Francisco J Ballesteros
@ 2005-11-08 19:56                       ` rog
  2005-11-08 23:22                         ` Joel Salomon
                                           ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: rog @ 2005-11-08 19:56 UTC (permalink / raw)
  To: 9fans

> Why search just mail? If you store your mail as files and put in place
> a search engine, the views and searches you want to make will work
> for it all.

that would be nice, but i think it's a bit ambitious for what i'm
looking at currently.  the search engine would have to be quite
intelligent:

1) it would have to be triggered on the arrival of new mail (otherwise
newly arrived messages would not be held in the index)
2) it would have to know which parts of the file system contained
mail messages and MIME parse them (assuming the mail files
were stored in raw format, which seems necessary for digital
signature verification, not to mention efficiency of delivery
and storage).

having just had a brief glance at the description of Google Desktop,
it appears that it probably does all these things.  in fact, given the
special parsing necessary to index different kinds of data, it's
probably irrelevant what format the mailbox is in - it's dealable
with.

i have to say that some kind of "google desktop for plan 9" would be
lovely, but going for mail first is perhaps a more immediately
realisable target.

the first step, anyway, in both cases, is writing the code to do the
inverted index.

i thought i'd write an external search algorithm - i'm most of the way
through an extendable hash implementation (which seems simple and
quick for insertion, but things get more complex when dealing with
large values, and on deletion; i'm not sure of the best way to deal
with block allocation; and more seriously, maybe it's essential to
have an algorithm that can do range (e.g.  prefix) lookups).  any
elegant (read *small*!), nicely implemented, open source libraries out
there that might fit the bill?  a good description of an appropriate
algorithm would do just as well...



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-08 19:56                       ` rog
@ 2005-11-08 23:22                         ` Joel Salomon
  2005-11-09  0:51                         ` Caerwyn Jones
  2005-11-09  3:32                         ` erik quanstrom
  2 siblings, 0 replies; 55+ messages in thread
From: Joel Salomon @ 2005-11-08 23:22 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On 11/8/05, rog@vitanuova.com <rog@vitanuova.com> wrote:
> i have to say that some kind of "google desktop for plan 9" would be
> lovely, but going for mail first is perhaps a more immediately
> realisable target.
>
Gmail for Plan 9, then?

--Joel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-08 19:56                       ` rog
  2005-11-08 23:22                         ` Joel Salomon
@ 2005-11-09  0:51                         ` Caerwyn Jones
  2005-11-09  0:55                           ` Russ Cox
  2005-11-09  3:32                         ` erik quanstrom
  2 siblings, 1 reply; 55+ messages in thread
From: Caerwyn Jones @ 2005-11-09  0:51 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> any elegant (read *small*!), nicely implemented, open source libraries out
> there that might fit the bill?

i've had some success at work using lucene (http://lucene.apache.org).
i'd recommend it. i have thought about implementing it in limbo, but
have yet to get around to it.

on inferno i use an inverted index i wrote based on a btree. it's on
my website, called lexis. i've indexed over 5 years of plain text
emails, one per file, and can usually find anything in seconds. it
supports ranges and is general enough to support file annotations,
categories and binary relations.

-caerwyn


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-09  0:51                         ` Caerwyn Jones
@ 2005-11-09  0:55                           ` Russ Cox
  0 siblings, 0 replies; 55+ messages in thread
From: Russ Cox @ 2005-11-09  0:55 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> on inferno i use an inverted index i wrote based on a btree. it's on
> my website, called lexis. i've indexed over 5 years of plain text
> emails, one per file, and can usually find anything in seconds. it
> supports ranges and is general enough to support file annotations,
> categories and binary relations.

seconds is too long.  it's really bothering me that gmail takes
seconds to answer my mail searches.  i feel like it used to be
much faster.  if i had a reasonable local interface with a fast
search i'd think about switching back.

russ


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-08 19:56                       ` rog
  2005-11-08 23:22                         ` Joel Salomon
  2005-11-09  0:51                         ` Caerwyn Jones
@ 2005-11-09  3:32                         ` erik quanstrom
  2 siblings, 0 replies; 55+ messages in thread
From: erik quanstrom @ 2005-11-09  3:32 UTC (permalink / raw)
  To: 9fans, rog

i don't think it would be that bad.

in 1996 in my former life as an IBMer we put up 30 million documents 
plus the OpenText web index up on the web. we were using OpenText's
"pat" engine "cpl" (aka "cpl") and some other full-text search software to run the
queries. if you haven't heard of these, don't feel bad. they're pretty unremarkable
and no longer exist. performance absolutely bit. partially becuase the interface
between the driver and the engine was xml. (go figure. tim bray was the big
man at opentext.) but the main reason was that the index was set up for
grep-like searches, doing the matching directly against the (patricia-tree'd) text.

i've come to think that's backwards. i think you should scan the corpus for a 
list if unique stemmed terms. (run running Run Run! all considered the same term)
and assign each term an index number. each document can be represnted as a 
string of unique index numbers which can be indexed using normal techniques.
a search would first convert the terms to index numbers and then do the search.
regular expressions could be applied to the /term/ list*, not the corpus.

you could prototype this with 3 tables (term_tab, doc_term_xref, doc_tab) 
from almost any databaseish thing that allows concurrent updates and queries.

obviously there's some generality lost. (proximity searches, whitespace/newline matching.)
but, i think this would get you 80% of what you would want at 20% of the complexity.

so many things to program, so little time.

- erik "mr vaporware" quanstrom

* my quick-and-dirty term check of my own email archive gives ~33000 terms.
(this is a big overcount.)


; cat */* | \
	tr 'A-Z' 'a-z' | \
	sed 's/[][;:"{}~`!@#$%^&*()+=|\\/?<>,.]/ /g' | \
	grep -v '[^a-z0-9]$' | \
	awk '
{
	for (i=1; i<=NF;i++) {
		l = length($i);
		if (l>1 && l<15)
			A[$i]++
	} 
}

END {
	n=0; 
	for(i in A) {
		n++; 
		printf("%d %s\n", A[i], i);
	}
	print n
}' |wc -l
33558

rog@vitanuova.com writes

| i thought i'd write an external search algorithm - i'm most of the way
| through an extendable hash implementation (which seems simple and
| quick for insertion, but things get more complex when dealing with
| large values, and on deletion; i'm not sure of the best way to deal
| with block allocation; and more seriously, maybe it's essential to
| have an algorithm that can do range (e.g.  prefix) lookups).  any
| elegant (read *small*!), nicely implemented, open source libraries out
| there that might fit the bill?  a good description of an appropriate
| algorithm would do just as well...


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-10  2:30       ` Russ Cox
  2005-11-10  6:33         ` Scott Schwartz
@ 2005-11-10 11:55         ` erik quanstrom
  1 sibling, 0 replies; 55+ messages in thread
From: erik quanstrom @ 2005-11-10 11:55 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs, Russ Cox

yes, and they've developed some interesting high-performance algorithms, which i've scanned,
but need to take a good look at.

the computational bio guys love it because they have long strings of base pairs that they want
to index. and suffix arrays are the ticket for that. the reason they love suffix arrays is that
there is no natural "word".

text searching would be the opposite. words are the natural unit (in speech there are no letters)
and words are often repeated.

- erik

Russ Cox <rsc@swtch.com> writes

| 
| > suffix arrays create an index that is bigger than the
| > original data. regardless of the theoretical O(1) mumble,
| > the size of the index is a major drawback.
| 
| That's true, but it depends a lot on the app.
| The computational biology guys seem to love them
| for indexing large amounts of DNA.
| 
| Russ


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-10  2:30       ` Russ Cox
@ 2005-11-10  6:33         ` Scott Schwartz
  2005-11-10 11:55         ` erik quanstrom
  1 sibling, 0 replies; 55+ messages in thread
From: Scott Schwartz @ 2005-11-10  6:33 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

| That's true, but it depends a lot on the app.
| The computational biology guys seem to love them
| for indexing large amounts of DNA.

Yes, but even there it's fair to say that opinion is mixed.  A lot of
really good bioinformatics code (e.g. blastz, megablast, blat) uses hash
table based methods instead.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-10  1:24     ` erik quanstrom
@ 2005-11-10  2:30       ` Russ Cox
  2005-11-10  6:33         ` Scott Schwartz
  2005-11-10 11:55         ` erik quanstrom
  0 siblings, 2 replies; 55+ messages in thread
From: Russ Cox @ 2005-11-10  2:30 UTC (permalink / raw)
  To: erik quanstrom, Fans of the OS Plan 9 from Bell Labs

> suffix arrays create an index that is bigger than the
> original data. regardless of the theoretical O(1) mumble,
> the size of the index is a major drawback.

That's true, but it depends a lot on the app.
The computational biology guys seem to love them
for indexing large amounts of DNA.

Russ


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-09 14:19   ` Sam
@ 2005-11-10  1:24     ` erik quanstrom
  2005-11-10  2:30       ` Russ Cox
  0 siblings, 1 reply; 55+ messages in thread
From: erik quanstrom @ 2005-11-10  1:24 UTC (permalink / raw)
  To: 9fans, Sam

suffix arrays create an index that is bigger than the 
original data. regardless of the theoretical O(1) mumble,
the size of the index is a major drawback.

erik

Sam <sah@softcardsystems.com> writes

| 
| In the not-so-distant past I was part of a three man
| effort to write a web site indexer / search engine
| generator.  My job was to take the indexed files / urls
| (they sucked them down with java) and create a suffix
| tree database that could be searched upon via cgi.  I
| don't have any specific numbers, but it was quite fast.
| 
| This was when google was just becoming known and once
| we realized we could point google at a website the
| project was abandoned.
| 
| The whole point of using suffix trees is linear time
| search wrt the size of the search string (note: not
| the size of the searched text).  Seems like it's
| a good candidate for this task.
| 
| Sam


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-09 10:24 ` Charles Forsyth
@ 2005-11-09 14:19   ` Sam
  2005-11-10  1:24     ` erik quanstrom
  0 siblings, 1 reply; 55+ messages in thread
From: Sam @ 2005-11-09 14:19 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

In the not-so-distant past I was part of a three man
effort to write a web site indexer / search engine
generator.  My job was to take the indexed files / urls
(they sucked them down with java) and create a suffix
tree database that could be searched upon via cgi.  I
don't have any specific numbers, but it was quite fast.

This was when google was just becoming known and once
we realized we could point google at a website the
project was abandoned.

The whole point of using suffix trees is linear time
search wrt the size of the search string (note: not
the size of the searched text).  Seems like it's
a good candidate for this task.

Sam



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-11-09  9:45 Fco. J. Ballesteros
@ 2005-11-09 10:24 ` Charles Forsyth
  2005-11-09 14:19   ` Sam
  0 siblings, 1 reply; 55+ messages in thread
From: Charles Forsyth @ 2005-11-09 10:24 UTC (permalink / raw)
  To: 9fans

> It would have to be triggered on the changing of files in the
> file system. With some help from the fs, this becomes cheap.

i myself don't want the mail system to have to rely on a particular
underlying file system.   you could certainly push an indexing fs in front
of any other, i suppose, but relying on there being a
definite article (`the' file system), as that and similar remarks
imply, doesn't seem right to me in a distributed system where
different filing resources can be bound in on demand.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
@ 2005-11-09  9:45 Fco. J. Ballesteros
  2005-11-09 10:24 ` Charles Forsyth
  0 siblings, 1 reply; 55+ messages in thread
From: Fco. J. Ballesteros @ 2005-11-09  9:45 UTC (permalink / raw)
  To: 9fans

:  that would be nice, but i think it's a bit ambitious for what i'm
:  looking at currently.  the search engine would have to be quite
:  intelligent:
:  
:  1) it would have to be triggered on the arrival of new mail (otherwise
:  newly arrived messages would not be held in the index)

It would have to be triggered on the changing of files in the
file system. With some help from the fs, this becomes cheap.

:  2) it would have to know which parts of the file system contained
:  mail messages and MIME parse them (assuming the mail files
:  were stored in raw format, which seems necessary for digital
:  signature verification, not to mention efficiency of delivery
:  and storage).

Don't agree. I store the messages in cooked format. That makes it easy to
understand mime :-)
If you want the raw message for whatever purposes,
you might also keep that thing. 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 15:33 Fco. J. Ballesteros
@ 2005-10-31 18:38 ` William Josephson
  0 siblings, 0 replies; 55+ messages in thread
From: William Josephson @ 2005-10-31 18:38 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, Oct 31, 2005 at 04:33:29PM +0100, Fco. J. Ballesteros wrote:
> :  5 seconds per 1000 seems a tad slow to me. But I'm picky. It still might 
> :  be worth it since you get all the tools. It's certainly faster than mh 
> :  ever was.
> 
> Yep. I agree. I think that the problem is that you need something else
> (appart from grep) to search on big file trees. Any search tool that lets
> you lookup file paths by content would work with your mail as well.
> We have a home grown program that does that, but it does not work well
> enough and does not deserve distribution.

I know at least one person who uses glimpse for this.
Of course his glimpse index process tended to knock
over one of the Sparc cycle servers with some regularity.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 16:22 ` Ronald G Minnich
@ 2005-10-31 18:37   ` William Josephson
  0 siblings, 0 replies; 55+ messages in thread
From: William Josephson @ 2005-10-31 18:37 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, Oct 31, 2005 at 09:22:00AM -0700, Ronald G Minnich wrote:
> Fco. J. Ballesteros wrote:
> >It's implemented. I'm using it :-)
> >Not a rocket, but this is fast enough I'd say (thanks to fossil/venti):
> 
> 5 seconds per 1000 seems a tad slow to me. But I'm picky. It still might 
> be worth it since you get all the tools. It's certainly faster than mh 
> ever was.

More like painfully slow.  From this calendar year alone,
I have over 120,000 messages.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 15:14 Fco. J. Ballesteros
@ 2005-10-31 16:22 ` Ronald G Minnich
  2005-10-31 18:37   ` William Josephson
  0 siblings, 1 reply; 55+ messages in thread
From: Ronald G Minnich @ 2005-10-31 16:22 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Fco. J. Ballesteros wrote:
> It's implemented. I'm using it :-)
> Not a rocket, but this is fast enough I'd say (thanks to fossil/venti):

5 seconds per 1000 seems a tad slow to me. But I'm picky. It still might 
be worth it since you get all the tools. It's certainly faster than mh 
ever was.

ron


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 11:32 [9fans] Scaleable mail repositories Fco. J. Ballesteros
@ 2005-10-31 16:01 ` Ronald G Minnich
  2005-10-31 15:06   ` jmk
  0 siblings, 1 reply; 55+ messages in thread
From: Ronald G Minnich @ 2005-10-31 16:01 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Fco. J. Ballesteros wrote:
> It's easy to write file servers, but that does not mean that it's the
> right way to proceed. IMHO, if you want to see your mail as files, and
> you have a file server, it's easier to store the mail in that format. All
> the code necessary to handle your storage and index structure becomes
> fossil/venti, and all that has to be done is to convert from the mbox format
> into your preferred archival format, and to feed upas with input messages
> for sending. Isn't this more simple and powerful? Or are you thinking of
> something else that is best done using the existing format?


just run mh 'scan' on 1000 files and make it as fast as the old 'msg' 
utility (which I went to from mh) and I'll buy it. MH got so painfully 
slow for me that I couldn't take it.

But, hey, implement it and let's see .

no need to argue.

ron


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
@ 2005-10-31 15:33 Fco. J. Ballesteros
  2005-10-31 18:38 ` William Josephson
  0 siblings, 1 reply; 55+ messages in thread
From: Fco. J. Ballesteros @ 2005-10-31 15:33 UTC (permalink / raw)
  To: 9fans

:  5 seconds per 1000 seems a tad slow to me. But I'm picky. It still might 
:  be worth it since you get all the tools. It's certainly faster than mh 
:  ever was.

Yep. I agree. I think that the problem is that you need something else
(appart from grep) to search on big file trees. Any search tool that lets
you lookup file paths by content would work with your mail as well.
We have a home grown program that does that,
but it does not work well enough and
does not deserve distribution.

I have placed a copy of our local /sys/src/cmd/mail2fs at
/n/sources/contrib/nemo/mail2fs, in case anyone wants to experiment before
we clean it up.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
@ 2005-10-31 15:19 Fco. J. Ballesteros
  0 siblings, 0 replies; 55+ messages in thread
From: Fco. J. Ballesteros @ 2005-10-31 15:19 UTC (permalink / raw)
  To: 9fans

:  And remember, whatever new scheme wins, it still has to be
:  able to read my 20+ years of mbox format messages too.

It can ;-), it's just a matter of converting them...

for (m in yourmboxes){
	mail2fs -d $home/mail/$m $m
}

I mean, I still use omail (~ acme's Mail) to read mboxes I did not
convert, and convert some of them to the dir hier. fmt. as I use them.
But IMHO, there is no problem with that.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
@ 2005-10-31 15:14 Fco. J. Ballesteros
  2005-10-31 16:22 ` Ronald G Minnich
  0 siblings, 1 reply; 55+ messages in thread
From: Fco. J. Ballesteros @ 2005-10-31 15:14 UTC (permalink / raw)
  To: 9fans

It's implemented. I'm using it :-)
Not a rocket, but this is fast enough I'd say (thanks to fossil/venti):

; cd /usr/nemo/mail/mbox
; ls | wc
   1294    1294    9050	# 1294 mails, one dir per mail
; time grep 'no need to argue' */text 
3376/text:no need to argue.
3377/text:> no need to argue.
0.03u 0.10s 5.07r 	 grep no need to argue 3101/text 3373/text 3376/text ...

:  
:  just run mh 'scan' on 1000 files and make it as fast as the old 'msg' 
:  utility (which I went to from mh) and I'll buy it. MH got so painfully 
:  slow for me that I couldn't take it.
:  
:  But, hey, implement it and let's see .



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
  2005-10-31 16:01 ` Ronald G Minnich
@ 2005-10-31 15:06   ` jmk
  0 siblings, 0 replies; 55+ messages in thread
From: jmk @ 2005-10-31 15:06 UTC (permalink / raw)
  To: 9fans

Spot on, Ron.

And remember, whatever new scheme wins, it still has to be
able to read my 20+ years of mbox format messages too.

--jim

On Mon Oct 31 10:02:13 EST 2005, rminnich@lanl.gov wrote:
> Fco. J. Ballesteros wrote:
> > It's easy to write file servers, but that does not mean that it's the
> > right way to proceed. IMHO, if you want to see your mail as files, and
> > you have a file server, it's easier to store the mail in that format. All
> > the code necessary to handle your storage and index structure becomes
> > fossil/venti, and all that has to be done is to convert from the mbox format
> > into your preferred archival format, and to feed upas with input messages
> > for sending. Isn't this more simple and powerful? Or are you thinking of
> > something else that is best done using the existing format?
> 
> 
> just run mh 'scan' on 1000 files and make it as fast as the old 'msg' 
> utility (which I went to from mh) and I'll buy it. MH got so painfully 
> slow for me that I couldn't take it.
> 
> But, hey, implement it and let's see .
> 
> no need to argue.
> 
> ron


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [9fans] Scaleable mail repositories.
@ 2005-10-31 11:32 Fco. J. Ballesteros
  2005-10-31 16:01 ` Ronald G Minnich
  0 siblings, 1 reply; 55+ messages in thread
From: Fco. J. Ballesteros @ 2005-10-31 11:32 UTC (permalink / raw)
  To: 9fans

It's easy to write file servers, but that does not mean that it's the
right way to proceed. IMHO, if you want to see your mail as files, and
you have a file server, it's easier to store the mail in that format. All
the code necessary to handle your storage and index structure becomes
fossil/venti, and all that has to be done is to convert from the mbox format
into your preferred archival format, and to feed upas with input messages
for sending. Isn't this more simple and powerful? Or are you thinking of
something else that is best done using the existing format?

:  i don't understand why, now that it's easy to write file servers
:  (compared to unix days), it's necessary to store the mail messages
:  as actual separate files or directories.  the main problem with
:  upas/fs i find is that it rewrites the file instead of treating it
:  as append-only, and it reads the whole thing into memory (in a moderately
:  bulky format); rather than maintaining a separate index file or files,
:  and loading as needed.   both the storage and index structure can
:  then be made suitable for the task.



^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2005-11-10 11:55 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-10-29 15:34 [9fans] rfork(RFPROC) and ffork() erik quanstrom
2005-10-29 19:11 ` William Josephson
2005-10-29 19:18 ` Russ Cox
2005-10-29 23:00   ` erik quanstrom
2005-10-29 23:24     ` Francisco J Ballesteros
2005-10-29 23:38     ` Russ Cox
2005-10-30  0:19       ` erik quanstrom
2005-10-30  1:07         ` Russ Cox
2005-10-30  1:15           ` Ronald G Minnich
2005-10-30  1:22             ` geoff
2005-10-30  1:58               ` jmk
2005-10-30  1:54           ` Dave Eckhardt
2005-10-30  2:24           ` erik quanstrom
2005-10-30  2:51             ` geoff
2005-10-30  1:10         ` geoff
2005-10-30  1:18           ` Paul Lalonde
2005-10-30  6:52             ` Skip Tavakkolian
2005-10-30 10:14             ` Francisco J Ballesteros
2005-10-30 15:17             ` Russ Cox
2005-10-30 23:00               ` Dave Eckhardt
2005-10-30 23:14                 ` George Michaelson
2005-10-31  2:15                 ` erik quanstrom
2005-10-31  2:33                   ` geoff
2005-10-31  3:23                   ` Skip Tavakkolian
2005-10-31  4:20                 ` Lyndon Nerenberg
2005-10-31 21:31                   ` Dave Eckhardt
2005-10-31  4:06             ` [9fans] Scaleable mail repositories Lyndon Nerenberg
2005-10-31 10:55               ` C H Forsyth
2005-10-31 12:32                 ` erik quanstrom
2005-11-01 19:56                   ` rog
2005-11-01 22:29                     ` Francisco J Ballesteros
2005-11-08 19:56                       ` rog
2005-11-08 23:22                         ` Joel Salomon
2005-11-09  0:51                         ` Caerwyn Jones
2005-11-09  0:55                           ` Russ Cox
2005-11-09  3:32                         ` erik quanstrom
2005-10-31 15:30                 ` jmk
2005-10-30  1:10       ` [9fans] rfork(RFPROC) and ffork() William Josephson
2005-10-31 14:48 ` Russ Cox
2005-10-31 11:32 [9fans] Scaleable mail repositories Fco. J. Ballesteros
2005-10-31 16:01 ` Ronald G Minnich
2005-10-31 15:06   ` jmk
2005-10-31 15:14 Fco. J. Ballesteros
2005-10-31 16:22 ` Ronald G Minnich
2005-10-31 18:37   ` William Josephson
2005-10-31 15:19 Fco. J. Ballesteros
2005-10-31 15:33 Fco. J. Ballesteros
2005-10-31 18:38 ` William Josephson
2005-11-09  9:45 Fco. J. Ballesteros
2005-11-09 10:24 ` Charles Forsyth
2005-11-09 14:19   ` Sam
2005-11-10  1:24     ` erik quanstrom
2005-11-10  2:30       ` Russ Cox
2005-11-10  6:33         ` Scott Schwartz
2005-11-10 11:55         ` erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).