Gnus development mailing list
 help / color / mirror / Atom feed
* OT [Archive techniques] What to do when it gets massive
@ 2004-08-12  1:34 Harry Putnam
  2004-08-12 13:29 ` Ted Zlatanov
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Harry Putnam @ 2004-08-12  1:34 UTC (permalink / raw)


I've been archiving a changing list of nntp and mail messages for a
very long time.  Partially to have something to test various search
techniques against that uses a hefty amount of data to search.

I've never really hit on a good method for doing this.  I started
with rsync and still use it like this:

  Run rsync against ~/News/agent/nntp using an exclude file that keeps
  out anything but the directories and messages, into a mirror of
  those directories.  The result is that as new messages come in and
  old are expired from ~/News they accumulate on /arch/news.

At some point the size is so large as to make any commands run against
the massive heap of data take a long time.  I'd like to break this
pile up somehow, but will work on that later.

Right now I'd like to start rsyncing to dated mirrors one month at a
time.  However I see no way to do this without having major overlap.

Example: Agent downloads for a month and I have a large accumulation
under News/agent/nntp.  These have been getting rsynced to this months
mirror.  

Now when I change over to a new month, and start feeding a new empty
mirror all the messages under News...nntp are copied there unless I
empty out News/agent/nntp, but even then without some hand work of
some kind, the agent will download what ever is still on the server in
the initial run, many of which will be overlaps.  Actually the vast
majority will.

Rsync seems to no kind of `newer' type thing like find has.

I've wondered if I just removed all the numbered files but left the
.agentview files in place if the agent will just continue with only
new messages it hasn't seen.  If that is the case, then that would be
one way to do it.

That way would leave only one major inconvenience.  I'd have no
backlog of messages in any groups for a while in case I wanted to A T
a thread or do a search or something.

I'm betting some of the seasoned troopers here have some much better
ways of doing this.  Answers of `use google instead' or
search.gmane.org instead are not accepted... hehe.   




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: OT [Archive techniques] What to do when it gets massive
  2004-08-12  1:34 OT [Archive techniques] What to do when it gets massive Harry Putnam
@ 2004-08-12 13:29 ` Ted Zlatanov
  2004-08-13  1:59   ` Harry Putnam
  2004-09-02 13:07 ` Kai Grossjohann
  2004-09-07 11:12 ` Kai Grossjohann
  2 siblings, 1 reply; 8+ messages in thread
From: Ted Zlatanov @ 2004-08-12 13:29 UTC (permalink / raw)
  Cc: ding

On Wed, 11 Aug 2004, reader@newsguy.com wrote:

> Right now I'd like to start rsyncing to dated mirrors one month at a
> time.  However I see no way to do this without having major overlap.

> I've wondered if I just removed all the numbered files but left the
> .agentview files in place if the agent will just continue with only
> new messages it hasn't seen.  If that is the case, then that would be
> one way to do it.

Look at rsnapshot, it automates this by doing a hard-link copy and
then rsync over the copy.  That way you only replace the files that
have changed but your disk usage is not significantly increased.

You can do this manually on the command line, but rsnapshot automates
it.

Ted



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: OT [Archive techniques] What to do when it gets massive
  2004-08-12 13:29 ` Ted Zlatanov
@ 2004-08-13  1:59   ` Harry Putnam
  2004-08-16 17:35     ` Ted Zlatanov
  0 siblings, 1 reply; 8+ messages in thread
From: Harry Putnam @ 2004-08-13  1:59 UTC (permalink / raw)


"Ted Zlatanov" <tzz@lifelogs.com> writes:

>> I've wondered if I just removed all the numbered files but left the
>> .agentview files in place if the agent will just continue with only
>> new messages it hasn't seen.  If that is the case, then that would be
>> one way to do it.
>
> Look at rsnapshot, it automates this by doing a hard-link copy and
> then rsync over the copy.  That way you only replace the files that
> have changed but your disk usage is not significantly increased.
>
> You can do this manually on the command line, but rsnapshot automates
> it.

I'm probably overlooking something fundamental but I don't see how
this is really any different than normal rsync, except for disk space
issue due to hardlinks.

I didn't see a way to not have major overlap between monthly archives.
That is, say on 8/30/04 my rsnpshot setup starts writing to a new
archive.  Its still based on files under ~/News/agent/nntp right? so
whatever is in there will be copied over to the new archive.
Including many if not all that were there for the previous month.

Even assuming a full gnu-agent-expire-all.  Short of actually deleting
all messages under ~/News/agent/nntp and preventing the agent from
redownloading any, there would be the same overlap problem... it
seems.

What seems to be missing in both rsync and rsnapshot is a way to
compare the files to be updated to a second source collection (in this
case the previous mnths archive) so that only files that ARE under
~/News/agent/nntp and are NOT in newsarch_072004 will be copied to
newsarch_082004.

That would leave one mnth in which the agent could expire its pile of
files down to only what came in that mnth.

So the only thing that wouldn't be readily scriptable would be the
gnus-agent-expire part.

Does any of this make sense or am I really missing the boat somewhere?

Another way to go at this might be to generate an `exclude-from' list
from the files found `uptodate' and those being moved in this run.

That compiled list would automatically be added to any static exclude
list.  (by user scripting) and would become the `exclude-from' list
for the next run.  

Seems like that might work.  So that when the destination target is
suddenly changed to an empty directory at beginning of a mnth, rsync
would have the last list to exclude by and not rely only on what it
(does not) finds in the new directory.

The more I discuss this here ... I'm beginning to think I may have hit
on something..

Having such lenthy exclude lists may really really increase processing
time or may even make it unusable.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: OT [Archive techniques] What to do when it gets massive
  2004-08-13  1:59   ` Harry Putnam
@ 2004-08-16 17:35     ` Ted Zlatanov
  2004-08-16 18:02       ` Harry Putnam
  0 siblings, 1 reply; 8+ messages in thread
From: Ted Zlatanov @ 2004-08-16 17:35 UTC (permalink / raw)
  Cc: ding

On Thu, 12 Aug 2004, reader@newsguy.com wrote:

> I'm probably overlooking something fundamental but I don't see how
> this is really any different than normal rsync, except for disk space
> issue due to hardlinks.
> 
> I didn't see a way to not have major overlap between monthly archives.
> That is, say on 8/30/04 my rsnpshot setup starts writing to a new
> archive.  Its still based on files under ~/News/agent/nntp right? so
> whatever is in there will be copied over to the new archive.
> Including many if not all that were there for the previous month.

Only files that have changed between A and B will be copied, yet you
will have two directories, one with the state of things from date A,
the other with the state of things from date B.  Files no longer in
existence at date B will be deleted in the B directory, but since
they are hardlinked they will still exist in the A directory.

I think this is exactly what you're looking for, based on your letter,
but I'm probably missing something too :)

Ted



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: OT [Archive techniques] What to do when it gets massive
  2004-08-16 17:35     ` Ted Zlatanov
@ 2004-08-16 18:02       ` Harry Putnam
  0 siblings, 0 replies; 8+ messages in thread
From: Harry Putnam @ 2004-08-16 18:02 UTC (permalink / raw)


"Ted Zlatanov" <tzz@lifelogs.com> writes:

> I think this is exactly what you're looking for, based on your letter,
> but I'm probably missing something too :)

I think you are right... thanks



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: OT [Archive techniques] What to do when it gets massive
  2004-08-12  1:34 OT [Archive techniques] What to do when it gets massive Harry Putnam
  2004-08-12 13:29 ` Ted Zlatanov
@ 2004-09-02 13:07 ` Kai Grossjohann
  2004-09-04 19:37   ` Harry Putnam
  2004-09-07 11:12 ` Kai Grossjohann
  2 siblings, 1 reply; 8+ messages in thread
From: Kai Grossjohann @ 2004-09-02 13:07 UTC (permalink / raw)


Harry Putnam <reader@newsguy.com> writes:

> At some point the size is so large as to make any commands run against
> the massive heap of data take a long time.  I'd like to break this
> pile up somehow, but will work on that later.

Does it help to index this stuff with a search engine?  Namazu seems
to be good.

I have a small dataset only, but I really like it.  I also started
using Namazu on my nnml folders, using gnus-namazu.el, and it seems to
be nice.

Kai



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: OT [Archive techniques] What to do when it gets massive
  2004-09-02 13:07 ` Kai Grossjohann
@ 2004-09-04 19:37   ` Harry Putnam
  0 siblings, 0 replies; 8+ messages in thread
From: Harry Putnam @ 2004-09-04 19:37 UTC (permalink / raw)


Kai Grossjohann <kai@emptydomain.de> writes:

> Harry Putnam <reader@newsguy.com> writes:
>
>> At some point the size is so large as to make any commands run against
>> the massive heap of data take a long time.  I'd like to break this
>> pile up somehow, but will work on that later.
>

> Does it help to index this stuff with a search engine?  Namazu seems
> to be good.
>
> I have a small dataset only, but I really like it.  I also started
> using Namazu on my nnml folders, using gnus-namazu.el, and it seems to
> be nice.

Of courese any indexing would have to be faster for searching.  But
what I'm after is a way to break up the heap of data by mnths.  As I
stated that is something that may take some thought and time and was
not really the subject here.

Ted has offered an applicaton called rsnapshot that looks as if it
might be just the ticket.  
Thanks for your reply.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: OT [Archive techniques] What to do when it gets massive
  2004-08-12  1:34 OT [Archive techniques] What to do when it gets massive Harry Putnam
  2004-08-12 13:29 ` Ted Zlatanov
  2004-09-02 13:07 ` Kai Grossjohann
@ 2004-09-07 11:12 ` Kai Grossjohann
  2 siblings, 0 replies; 8+ messages in thread
From: Kai Grossjohann @ 2004-09-07 11:12 UTC (permalink / raw)


Harry Putnam <reader@newsguy.com> writes:

> I've been archiving a changing list of nntp and mail messages for a
> very long time.

To split mail archives by month, perhaps the easiest method is to let
the messages expire and to configure expiry such that, instead of
deleting the message, it moves the message to an archive group.

And then you configure Gnus such that the archive location is a
different one every month, and you're all set.

But for nntp this seems to be more difficult.  Hm.  I'm not sure
what's a good method for this.  Perhaps extending gnus-agent-expire to
do something similar to the expiry-target feature for mail?

Kai



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2004-09-07 11:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-08-12  1:34 OT [Archive techniques] What to do when it gets massive Harry Putnam
2004-08-12 13:29 ` Ted Zlatanov
2004-08-13  1:59   ` Harry Putnam
2004-08-16 17:35     ` Ted Zlatanov
2004-08-16 18:02       ` Harry Putnam
2004-09-02 13:07 ` Kai Grossjohann
2004-09-04 19:37   ` Harry Putnam
2004-09-07 11:12 ` Kai Grossjohann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).