Gnus development mailing list
 help / color / mirror / Atom feed
From: Harry Putnam <reader@newsguy.com>
Subject: OT [Archive techniques] What to do when it gets massive
Date: Wed, 11 Aug 2004 20:34:02 -0500	[thread overview]
Message-ID: <m3hdr95ed1.fsf@newsguy.com> (raw)

I've been archiving a changing list of nntp and mail messages for a
very long time.  Partially to have something to test various search
techniques against that uses a hefty amount of data to search.

I've never really hit on a good method for doing this.  I started
with rsync and still use it like this:

  Run rsync against ~/News/agent/nntp using an exclude file that keeps
  out anything but the directories and messages, into a mirror of
  those directories.  The result is that as new messages come in and
  old are expired from ~/News they accumulate on /arch/news.

At some point the size is so large as to make any commands run against
the massive heap of data take a long time.  I'd like to break this
pile up somehow, but will work on that later.

Right now I'd like to start rsyncing to dated mirrors one month at a
time.  However I see no way to do this without having major overlap.

Example: Agent downloads for a month and I have a large accumulation
under News/agent/nntp.  These have been getting rsynced to this months
mirror.  

Now when I change over to a new month, and start feeding a new empty
mirror all the messages under News...nntp are copied there unless I
empty out News/agent/nntp, but even then without some hand work of
some kind, the agent will download what ever is still on the server in
the initial run, many of which will be overlaps.  Actually the vast
majority will.

Rsync seems to no kind of `newer' type thing like find has.

I've wondered if I just removed all the numbered files but left the
.agentview files in place if the agent will just continue with only
new messages it hasn't seen.  If that is the case, then that would be
one way to do it.

That way would leave only one major inconvenience.  I'd have no
backlog of messages in any groups for a while in case I wanted to A T
a thread or do a search or something.

I'm betting some of the seasoned troopers here have some much better
ways of doing this.  Answers of `use google instead' or
search.gmane.org instead are not accepted... hehe.   




             reply	other threads:[~2004-08-12  1:34 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-08-12  1:34 Harry Putnam [this message]
2004-08-12 13:29 ` Ted Zlatanov
2004-08-13  1:59   ` Harry Putnam
2004-08-16 17:35     ` Ted Zlatanov
2004-08-16 18:02       ` Harry Putnam
2004-09-02 13:07 ` Kai Grossjohann
2004-09-04 19:37   ` Harry Putnam
2004-09-07 11:12 ` Kai Grossjohann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m3hdr95ed1.fsf@newsguy.com \
    --to=reader@newsguy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).