Performance with large mailboxes and nnsql.el

Gnus development mailing list
 help / color / mirror / Atom feed

* Performance with large mailboxes and nnsql.el
@ 2002-04-27  1:19 Joseph Barillari
  2002-04-27 14:28 ` Kristoffer Gleditsch
  0 siblings, 1 reply; 5+ messages in thread
From: Joseph Barillari @ 2002-04-27  1:19 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 328 bytes --]

Out of curiosity, is anyone working on nnsql.el? Or are there any
fragments of a half-finished nnsql.el lying around with which I could
begin work? 

I'm looking for a good home for a ~30000 message mbox, and I assume
that a SQL database would provide the best performance. Is this
mistaken? Would nnml be faster?

--Joe

[-- Attachment #2: Type: application/pgp-signature, Size: 268 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Performance with large mailboxes and nnsql.el
  2002-04-27  1:19 Performance with large mailboxes and nnsql.el Joseph Barillari
@ 2002-04-27 14:28 ` Kristoffer Gleditsch
  2002-05-04  0:37   ` Joseph Barillari
  0 siblings, 1 reply; 5+ messages in thread
From: Kristoffer Gleditsch @ 2002-04-27 14:28 UTC (permalink / raw)


[ Joseph Barillari ]

> Out of curiosity, is anyone working on nnsql.el? Or are there any
> fragments of a half-finished nnsql.el lying around with which I
> could begin work?

I did some work on such a backend some time ago, using Eric Marsden's
PostgreSQL interface library
(http://www.chez.com/emarsden/downloads/pg.el).  It almost gave me a
working summary buffer, but I never got around to doing the database
insertion stuff, so it has been standing still for some time.  If
anyone wants to take a look at it, it can be checked out of anonymous
CVS following the instructions at http://www.ping.uio.no/anoncvs.shtml
(just use 'nnsql' instead of 'frotz'.)

(Warning: I didn't know a lot of elisp when I started it, so the code
is probably quite, uhm, interesting in places.)

-- 
Kristoffer.





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Performance with large mailboxes and nnsql.el
  2002-04-27 14:28 ` Kristoffer Gleditsch
@ 2002-05-04  0:37   ` Joseph Barillari
  2002-05-13 23:34     ` Kristoffer Gleditsch
  0 siblings, 1 reply; 5+ messages in thread
From: Joseph Barillari @ 2002-05-04  0:37 UTC (permalink / raw)
  Cc: ding

[-- Attachment #1: Type: text/plain, Size: 3621 bytes --]

>>>>> "KG" == Kristoffer Gleditsch <toffer@ping.uio.no> writes:

    > [ Joseph Barillari ]
    >> Out of curiosity, is anyone working on nnsql.el? Or are there
    >> any fragments of a half-finished nnsql.el lying around with
    >> which I could begin work?

    > I did some work on such a backend some time ago, using Eric
    > Marsden's PostgreSQL interface library
    > (http://www.chez.com/emarsden/downloads/pg.el).  It almost gave
    > me a working summary buffer, but I never got around to doing the
    > database insertion stuff, so it has been standing still for some
    > time.  If anyone wants to take a look at it, it can be checked
    > out of anonymous CVS following the instructions at
    > http://www.ping.uio.no/anoncvs.shtml (just use 'nnsql' instead
    > of 'frotz'.)

I noticed in your code:

/* We store each line as a separate database entry.  I don't know if
   that is a good idea or not.  Probably not.  */

I'd agree. One cannot perform a regular expression search on the
message body quite so easily that way (Postgres has such a
feature). 

If splitting the bodies into individual lines is a bad idea,
presumably each body should be kept intact in a TEXT field.

What do you think of splitting headers? I assume splitting the
most-often-searched-upon headers (To:, Cc:, Subject:, etc.) into
separate fields would be a good idea -- it would aid searching, make
it easier to write other clients (web-based clients, for instance),
and it would let the underlying database build indexes. 

Most headers (except From:) can and often do consist of a
comma-delimited list of entries. Splitting these up further along the
comma-delimitations, and inserting them into a separate table is
probably a good idea, once again, as it can improve indexing
performance.

What to do with the message body is another matter.

If one stores the headers separately, should one also remove them from
the body text? Leaving them in the body would be a Bad Idea in the
relational-database frame of thinking -- data should never be
duplicated. But stripping them out means that the database will not be
storing exact copies of the messages. Reconstructing the headers in
the exact order in which they appeared may not be possible unless some
information is left in the database.

This would be a non-issue if the message headers were static. But
headers often need to be changed -- the drafts mailbox is a perfect
example. If the headers are left in the body, and Gnus needs to change
them, then the entire message has to be rewritten -- not just the
header in question.

The only way around that re-writing would be to remove the
stored-separately headers and replace them with some sort of tags that
Gnus will replace with the actual headers when the message is fed into
a buffer. This would speed writes, as the body would not need to be
touched, but make reads ugly and replete with string-replacements, to
say nothing of the additional difficulty in writing read-only clients
(like web-based readers.)

What do you think of this pseudo-model, using the
leave-headers-in-body option?

_Messages_           _Headers_  /* but only interesting headers */
 Body TEXT            Header TEXT                                
 Serial INTEGER--|    Contents TEXT                              
 From TEXT       |----Message INTEGER                            
 MsgID TEXT
 Rec'd TIMESTAMP   
 Date TIMESTAMP
 Mailbox TEXT /*all of the nnsql: mailboxes could be 
                   accommodated in a single table*/

--Joe

[-- Attachment #2: Type: application/pgp-signature, Size: 268 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Performance with large mailboxes and nnsql.el
  2002-05-04  0:37   ` Joseph Barillari
@ 2002-05-13 23:34     ` Kristoffer Gleditsch
  2002-05-27 19:58       ` Joseph Barillari
  0 siblings, 1 reply; 5+ messages in thread
From: Kristoffer Gleditsch @ 2002-05-13 23:34 UTC (permalink / raw)
  Cc: ding

[ Joseph Barillari ]

> If splitting the bodies into individual lines is a bad idea,
> presumably each body should be kept intact in a TEXT field.

I agree. (I started making this database layout while entry sizes in
PSQL were a bit more of an issue than they are today.)

> What do you think of splitting headers?

I'm all for it.  :)

> Most headers (except From:) can and often do consist of a
> comma-delimited list of entries. Splitting these up further along
> the comma-delimitations, and inserting them into a separate table is
> probably a good idea, once again, as it can improve indexing
> performance.

I must admit I haven't thought a lot about that.  But this is a
database, after all, so I think it makes sense to split them up.

> If one stores the headers separately, should one also remove them
> from the body text?

One could just split them from the body, storing the entire header
part of the mail in a separate text record.  That way the exact
original headers would still be available, while being "out of the
way" for searches and stuff.

> Messages           Headers  /* but only interesting headers */
> Body TEXT            Header TEXT
> Serial INTEGER--|    Contents TEXT
> From TEXT       |----Message INTEGER
> MsgID TEXT
> Rec'd TIMESTAMP
> Date TIMESTAMP
> Mailbox TEXT /*all of the nnsql: mailboxes could be
>                   accommodated in a single table*/

I'm not sure about the idea of a generic header table; if we're going
to split lots of headers into separate tuples, I think perhaps one
table per header type will be easier to deal with.

About the mailbox thing: I plan to use the keywords (bad name, I'm
renaming it to groups) table for that; it contains a mapping between
groups/mailboxes and messages.  That way the same message can belong
to several (or no) groups without having to store it twice.

I've thrown together some pseudo-SQL for a new database layout, and
checked it into CVS as 'creation2.sql'.  It would be great if you
could take a look at it and let me know what you think.

(Sorry about the late answer.  I'll be pretty busy for another month
or so, but with some luck I'll get time to look some more at this
during the summer.  If you want to do stuff with this code in the
meantime, I think that's great; don't let me be stop you.  :-)

-- 
Kristoffer.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Performance with large mailboxes and nnsql.el
  2002-05-13 23:34     ` Kristoffer Gleditsch
@ 2002-05-27 19:58       ` Joseph Barillari
  0 siblings, 0 replies; 5+ messages in thread
From: Joseph Barillari @ 2002-05-27 19:58 UTC (permalink / raw)
  Cc: ding

[-- Attachment #1: Type: text/plain, Size: 3658 bytes --]

>>>>> "KG" == Kristoffer Gleditsch <toffer@ping.uio.no> writes:

      <snip>

    >> If one stores the headers separately, should one also remove
    >> them from the body text?

    KG> One could just split them from the body, storing the entire
    KG> header part of the mail in a separate text record.  That way
    KG> the exact original headers would still be available, while
    KG> being "out of the way" for searches and stuff.

That still doesn't solve the update question. What happens if the user
edits the To: header? Gnus would have to rewrite both the text block
in which the raw headers are stored *and* the entry in the table that
stores the To: headers. This violates the "don't store duplicate data"
axiom of database design, so I'm hesitant to accept it. However, I
don't see any other way to preserve both the original headers and gain
the benefits of indexing them, short of the ugly
replace-headers-with-tags scheme I mentioned in my last message.

<snip>

    KG> I'm not sure about the idea of a generic header table; if
    KG> we're going to split lots of headers into separate tuples, I
    KG> think perhaps one table per header type will be easier to deal
    KG> with.

My inclination is to prefer generality, and use a generalized header
table. This has the advantage of being trivially extensible to cover
different headers, and shouldn't impact lookup speed, if the database
is doing a good job of indexing.

However, it would require (as you indicated in creation2.sql) creating
a unique ID for each header, so as to enable unambiguous references
to individual headers. 

If we used your suggestion and split certain headers into their own
tables, we could add a constraint that forces uniqueness on the
message serial number, without bothering to create a separate serial
number for each header. This applies only for headers that appear once
per message, as multiple-instantiation headers like Received: would
need serial numbers if we split them from the message text.

Perhaps we could have two generalized tables: one for headers that
appear once and only once (Subject:) and one for headers that appear
multiple times (Received:) or can contain multiple items (To:).

    KG> About the mailbox thing: I plan to use the keywords (bad name,
    KG> I'm renaming it to groups) table for that; it contains a
    KG> mapping between groups/mailboxes and messages.  That way the
    KG> same message can belong to several (or no) groups without
    KG> having to store it twice.

Good idea.

    KG> I've thrown together some pseudo-SQL for a new database
    KG> layout, and checked it into CVS as 'creation2.sql'.  It would
    KG> be great if you could take a look at it and let me know what
    KG> you think.

Looks fine. Two issues remain. One: what is the best way to store the
raw headers? Two: Should we use one table, two tables*, or many
tables** to store the split-off headers?

    KG> (Sorry about the late answer.  I'll be pretty busy for another
    KG> month or so, but with some luck I'll get time to look some
    KG> more at this during the summer.  If you want to do stuff with
    KG> this code in the meantime, I think that's great; don't let me
    KG> be stop you.  :-)

The delay is no problem; I've been busy as well. (I just read your
message today.)

--Joe

*One for headers that appear only once per message with a uniqueness
 constraint on the message serial number and the header name, and
 another for headers instantiated more than once, each with its own
 serial number.

**One for each header type.

[-- Attachment #2: Type: application/pgp-signature, Size: 268 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2002-05-27 19:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-04-27  1:19 Performance with large mailboxes and nnsql.el Joseph Barillari
2002-04-27 14:28 ` Kristoffer Gleditsch
2002-05-04  0:37   ` Joseph Barillari
2002-05-13 23:34     ` Kristoffer Gleditsch
2002-05-27 19:58       ` Joseph Barillari

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).