* Performance with large mailboxes and nnsql.el @ 2002-04-27 1:19 Joseph Barillari 2002-04-27 14:28 ` Kristoffer Gleditsch 0 siblings, 1 reply; 5+ messages in thread From: Joseph Barillari @ 2002-04-27 1:19 UTC (permalink / raw) [-- Attachment #1: Type: text/plain, Size: 328 bytes --] Out of curiosity, is anyone working on nnsql.el? Or are there any fragments of a half-finished nnsql.el lying around with which I could begin work? I'm looking for a good home for a ~30000 message mbox, and I assume that a SQL database would provide the best performance. Is this mistaken? Would nnml be faster? --Joe [-- Attachment #2: Type: application/pgp-signature, Size: 268 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Performance with large mailboxes and nnsql.el 2002-04-27 1:19 Performance with large mailboxes and nnsql.el Joseph Barillari @ 2002-04-27 14:28 ` Kristoffer Gleditsch 2002-05-04 0:37 ` Joseph Barillari 0 siblings, 1 reply; 5+ messages in thread From: Kristoffer Gleditsch @ 2002-04-27 14:28 UTC (permalink / raw) [ Joseph Barillari ] > Out of curiosity, is anyone working on nnsql.el? Or are there any > fragments of a half-finished nnsql.el lying around with which I > could begin work? I did some work on such a backend some time ago, using Eric Marsden's PostgreSQL interface library (http://www.chez.com/emarsden/downloads/pg.el). It almost gave me a working summary buffer, but I never got around to doing the database insertion stuff, so it has been standing still for some time. If anyone wants to take a look at it, it can be checked out of anonymous CVS following the instructions at http://www.ping.uio.no/anoncvs.shtml (just use 'nnsql' instead of 'frotz'.) (Warning: I didn't know a lot of elisp when I started it, so the code is probably quite, uhm, interesting in places.) -- Kristoffer. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Performance with large mailboxes and nnsql.el 2002-04-27 14:28 ` Kristoffer Gleditsch @ 2002-05-04 0:37 ` Joseph Barillari 2002-05-13 23:34 ` Kristoffer Gleditsch 0 siblings, 1 reply; 5+ messages in thread From: Joseph Barillari @ 2002-05-04 0:37 UTC (permalink / raw) Cc: ding [-- Attachment #1: Type: text/plain, Size: 3621 bytes --] >>>>> "KG" == Kristoffer Gleditsch <toffer@ping.uio.no> writes: > [ Joseph Barillari ] >> Out of curiosity, is anyone working on nnsql.el? Or are there >> any fragments of a half-finished nnsql.el lying around with >> which I could begin work? > I did some work on such a backend some time ago, using Eric > Marsden's PostgreSQL interface library > (http://www.chez.com/emarsden/downloads/pg.el). It almost gave > me a working summary buffer, but I never got around to doing the > database insertion stuff, so it has been standing still for some > time. If anyone wants to take a look at it, it can be checked > out of anonymous CVS following the instructions at > http://www.ping.uio.no/anoncvs.shtml (just use 'nnsql' instead > of 'frotz'.) I noticed in your code: /* We store each line as a separate database entry. I don't know if that is a good idea or not. Probably not. */ I'd agree. One cannot perform a regular expression search on the message body quite so easily that way (Postgres has such a feature). If splitting the bodies into individual lines is a bad idea, presumably each body should be kept intact in a TEXT field. What do you think of splitting headers? I assume splitting the most-often-searched-upon headers (To:, Cc:, Subject:, etc.) into separate fields would be a good idea -- it would aid searching, make it easier to write other clients (web-based clients, for instance), and it would let the underlying database build indexes. Most headers (except From:) can and often do consist of a comma-delimited list of entries. Splitting these up further along the comma-delimitations, and inserting them into a separate table is probably a good idea, once again, as it can improve indexing performance. What to do with the message body is another matter. If one stores the headers separately, should one also remove them from the body text? Leaving them in the body would be a Bad Idea in the relational-database frame of thinking -- data should never be duplicated. But stripping them out means that the database will not be storing exact copies of the messages. Reconstructing the headers in the exact order in which they appeared may not be possible unless some information is left in the database. This would be a non-issue if the message headers were static. But headers often need to be changed -- the drafts mailbox is a perfect example. If the headers are left in the body, and Gnus needs to change them, then the entire message has to be rewritten -- not just the header in question. The only way around that re-writing would be to remove the stored-separately headers and replace them with some sort of tags that Gnus will replace with the actual headers when the message is fed into a buffer. This would speed writes, as the body would not need to be touched, but make reads ugly and replete with string-replacements, to say nothing of the additional difficulty in writing read-only clients (like web-based readers.) What do you think of this pseudo-model, using the leave-headers-in-body option? _Messages_ _Headers_ /* but only interesting headers */ Body TEXT Header TEXT Serial INTEGER--| Contents TEXT From TEXT |----Message INTEGER MsgID TEXT Rec'd TIMESTAMP Date TIMESTAMP Mailbox TEXT /*all of the nnsql: mailboxes could be accommodated in a single table*/ --Joe [-- Attachment #2: Type: application/pgp-signature, Size: 268 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Performance with large mailboxes and nnsql.el 2002-05-04 0:37 ` Joseph Barillari @ 2002-05-13 23:34 ` Kristoffer Gleditsch 2002-05-27 19:58 ` Joseph Barillari 0 siblings, 1 reply; 5+ messages in thread From: Kristoffer Gleditsch @ 2002-05-13 23:34 UTC (permalink / raw) Cc: ding [ Joseph Barillari ] > If splitting the bodies into individual lines is a bad idea, > presumably each body should be kept intact in a TEXT field. I agree. (I started making this database layout while entry sizes in PSQL were a bit more of an issue than they are today.) > What do you think of splitting headers? I'm all for it. :) > Most headers (except From:) can and often do consist of a > comma-delimited list of entries. Splitting these up further along > the comma-delimitations, and inserting them into a separate table is > probably a good idea, once again, as it can improve indexing > performance. I must admit I haven't thought a lot about that. But this is a database, after all, so I think it makes sense to split them up. > If one stores the headers separately, should one also remove them > from the body text? One could just split them from the body, storing the entire header part of the mail in a separate text record. That way the exact original headers would still be available, while being "out of the way" for searches and stuff. > Messages Headers /* but only interesting headers */ > Body TEXT Header TEXT > Serial INTEGER--| Contents TEXT > From TEXT |----Message INTEGER > MsgID TEXT > Rec'd TIMESTAMP > Date TIMESTAMP > Mailbox TEXT /*all of the nnsql: mailboxes could be > accommodated in a single table*/ I'm not sure about the idea of a generic header table; if we're going to split lots of headers into separate tuples, I think perhaps one table per header type will be easier to deal with. About the mailbox thing: I plan to use the keywords (bad name, I'm renaming it to groups) table for that; it contains a mapping between groups/mailboxes and messages. That way the same message can belong to several (or no) groups without having to store it twice. I've thrown together some pseudo-SQL for a new database layout, and checked it into CVS as 'creation2.sql'. It would be great if you could take a look at it and let me know what you think. (Sorry about the late answer. I'll be pretty busy for another month or so, but with some luck I'll get time to look some more at this during the summer. If you want to do stuff with this code in the meantime, I think that's great; don't let me be stop you. :-) -- Kristoffer. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Performance with large mailboxes and nnsql.el 2002-05-13 23:34 ` Kristoffer Gleditsch @ 2002-05-27 19:58 ` Joseph Barillari 0 siblings, 0 replies; 5+ messages in thread From: Joseph Barillari @ 2002-05-27 19:58 UTC (permalink / raw) Cc: ding [-- Attachment #1: Type: text/plain, Size: 3658 bytes --] >>>>> "KG" == Kristoffer Gleditsch <toffer@ping.uio.no> writes: <snip> >> If one stores the headers separately, should one also remove >> them from the body text? KG> One could just split them from the body, storing the entire KG> header part of the mail in a separate text record. That way KG> the exact original headers would still be available, while KG> being "out of the way" for searches and stuff. That still doesn't solve the update question. What happens if the user edits the To: header? Gnus would have to rewrite both the text block in which the raw headers are stored *and* the entry in the table that stores the To: headers. This violates the "don't store duplicate data" axiom of database design, so I'm hesitant to accept it. However, I don't see any other way to preserve both the original headers and gain the benefits of indexing them, short of the ugly replace-headers-with-tags scheme I mentioned in my last message. <snip> KG> I'm not sure about the idea of a generic header table; if KG> we're going to split lots of headers into separate tuples, I KG> think perhaps one table per header type will be easier to deal KG> with. My inclination is to prefer generality, and use a generalized header table. This has the advantage of being trivially extensible to cover different headers, and shouldn't impact lookup speed, if the database is doing a good job of indexing. However, it would require (as you indicated in creation2.sql) creating a unique ID for each header, so as to enable unambiguous references to individual headers. If we used your suggestion and split certain headers into their own tables, we could add a constraint that forces uniqueness on the message serial number, without bothering to create a separate serial number for each header. This applies only for headers that appear once per message, as multiple-instantiation headers like Received: would need serial numbers if we split them from the message text. Perhaps we could have two generalized tables: one for headers that appear once and only once (Subject:) and one for headers that appear multiple times (Received:) or can contain multiple items (To:). KG> About the mailbox thing: I plan to use the keywords (bad name, KG> I'm renaming it to groups) table for that; it contains a KG> mapping between groups/mailboxes and messages. That way the KG> same message can belong to several (or no) groups without KG> having to store it twice. Good idea. KG> I've thrown together some pseudo-SQL for a new database KG> layout, and checked it into CVS as 'creation2.sql'. It would KG> be great if you could take a look at it and let me know what KG> you think. Looks fine. Two issues remain. One: what is the best way to store the raw headers? Two: Should we use one table, two tables*, or many tables** to store the split-off headers? KG> (Sorry about the late answer. I'll be pretty busy for another KG> month or so, but with some luck I'll get time to look some KG> more at this during the summer. If you want to do stuff with KG> this code in the meantime, I think that's great; don't let me KG> be stop you. :-) The delay is no problem; I've been busy as well. (I just read your message today.) --Joe *One for headers that appear only once per message with a uniqueness constraint on the message serial number and the header name, and another for headers instantiated more than once, each with its own serial number. **One for each header type. [-- Attachment #2: Type: application/pgp-signature, Size: 268 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2002-05-27 19:58 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-04-27 1:19 Performance with large mailboxes and nnsql.el Joseph Barillari 2002-04-27 14:28 ` Kristoffer Gleditsch 2002-05-04 0:37 ` Joseph Barillari 2002-05-13 23:34 ` Kristoffer Gleditsch 2002-05-27 19:58 ` Joseph Barillari
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).