Gnus development mailing list
 help / color / mirror / Atom feed
* Diffinitive archiving method sought - Big prize money for best entrant
@ 2002-09-05  2:17 Harry Putnam
  2002-09-06 15:50 ` Kai Großjohann
  0 siblings, 1 reply; 4+ messages in thread
From: Harry Putnam @ 2002-09-05  2:17 UTC (permalink / raw)


Ok, so the big prize money was a lie....

But I know there are some card carrying archivist here:

I've long wanted a smooth working archiving technique but can't
really come up with somethings that isn't either horribly complex or
allows some unwanted overlap.

I want to use my current method which is rsync, but have another
technique that operates to separate my archives into chronological
chunks and allows no overlap.

Let me explain:
I currently download a number of groups with the agent, and
periodically run the agent expiry routine.  Over time the selected
groups to download may change, and have quite a few times over the
years.  So I don't really want something like the `expiry to target'
stuff available for mail.  Too much futzing around as selected groups
change. 

I use rsync to grab new stuff from the agentized messages, adding
them to an archive on a daily cron job.  Anyone who knows about rsync
will know that it will keep adding any new messages to an archive.
Ok, fine so far.  but given enough time, that archive will grow
un-usably large.  Or at least large enough to be a pain to search etc.

I want to break it up into chunks of some kind.  I think calendar
quarters would be good.  I don't mean here that the messages in the
archive have to fall inside a certain quarter, but only that the
quarter not hold any dups from the one before or after.  So, I'm not
concerned about message dates, although they would by and large, fall
in place.  Just a user imposed quarter of collected messages.

I can't think of a way to do this directly with rsync, like renaming
an acumulated archive at a certain point and beginning a new one.
Rsync lacks anything like the `newer' operator in `find' command, so
that kind of approach is out too.

Two things that cause overlap come up.  The way rsync works, it would
start grabbing messages from the ~/News/agent that had already been
archived since they wouldn't be in the new archive.  

Only way I can see to prevent that would be to reduce the agentized
stock to zero at the same time.

If one did that, then the agent itself would redownload some already
archived stuff ( I think, but haven't tested to make sure ).

Even if the agent knows not to redownload, it would be inconvenient
to reduce agentized messages to zero.  Leaving no backlog to work
with in gnus.

Scripting to separate the messages by date is not too big a deal but
I'm talking about 500,000+ messages having to be visited individually
and sorted according to date, while generating a directory structure
to house it.

That may be the only way.  I wondered if anyone has a slicker faster
smarter way?



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Diffinitive archiving method sought - Big prize money for best entrant
  2002-09-05  2:17 Diffinitive archiving method sought - Big prize money for best entrant Harry Putnam
@ 2002-09-06 15:50 ` Kai Großjohann
  2002-09-06 21:42   ` Harry Putnam
  0 siblings, 1 reply; 4+ messages in thread
From: Kai Großjohann @ 2002-09-06 15:50 UTC (permalink / raw)
  Cc: ding

Harry Putnam <reader@newsguy.com> writes:

> I use rsync to grab new stuff from the agentized messages, adding
> them to an archive on a daily cron job.  Anyone who knows about rsync
> will know that it will keep adding any new messages to an archive.
> Ok, fine so far.  but given enough time, that archive will grow
> un-usably large.  Or at least large enough to be a pain to search etc.
>
> I want to break it up into chunks of some kind.  I think calendar
> quarters would be good.  I don't mean here that the messages in the
> archive have to fall inside a certain quarter, but only that the
> quarter not hold any dups from the one before or after.  So, I'm not
> concerned about message dates, although they would by and large, fall
> in place.  Just a user imposed quarter of collected messages.

Collect 6 months worth of articles in a directory.  Then archive the
ones older than 3 months into your archive and remove them from the
directory.  Then you wait another three months and again archive the
old messages and remove them.

Now comes the problem how to remove articles.  If you're careful, it
should be possible by removing the articles themselves plus the
overview entries plus perhaps adjusting the active file.  Another
possibility is to figure out which function F is called from
gnus-agent-expire to actually delete, then get a list of messages
archived and call that function F on those messages.

Not sure how to do it for leafnode, if you decide to use that.

kai
-- 
A large number of young women don't trust men with beards.  (BFBS Radio)



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Diffinitive archiving method sought - Big prize money for best entrant
  2002-09-06 15:50 ` Kai Großjohann
@ 2002-09-06 21:42   ` Harry Putnam
  2002-09-07 19:04     ` Kai Großjohann
  0 siblings, 1 reply; 4+ messages in thread
From: Harry Putnam @ 2002-09-06 21:42 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=us-ascii, Size: 2537 bytes --]

Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:

> Collect 6 months worth of articles in a directory.  Then archive the
> ones older than 3 months into your archive and remove them from the
> directory.  Then you wait another three months and again archive the
> old messages and remove them.
>
> Now comes the problem how to remove articles.  If you're careful, it
> should be possible by removing the articles themselves plus the
> overview entries plus perhaps adjusting the active file.  Another
> possibility is to figure out which function F is called from
> gnus-agent-expire to actually delete, then get a list of messages
> archived and call that function F on those messages.

I think I understand the proceedure Kai, but it sounds even more
labor intensive than anything I had come up with.  My whole aim is to
find a lazy way to do it.

Piecing out the overview files and such doesn't fit into my lazy man
scheme.. hehe.

Also I may not have made clear that the archive itself is not
maintained under gnus.  It just gets fed from there.

I'm thinking a script that works on message dates will be about
right.  The odd message that comes in 3mnths late in a thread won't
be important enough to try to allow for.  I considered using file
dates instead but there are too many ways a file date might get
changed over time given OS changes or complete revamps etc etc. 
Even mishaps where a section or all is destroyed and rebuilt from
online archives or the like. 

Thanks for the idea..

I guess I've just been too lazy to write some perl to do this.

I've begun to get a semi outline in mind so guess I'll get started on
it.  Maybe I should repost this stuff to gnu.emacs.gnus too:

Something like this should work:

1) date regex like this: look for  /^Date:.*(Jan|Feb|Mar)/
                        Take action if blank line is seen

Action to take
2) (There is perl stuff to do these things but I have refresh my self
    on it).
   using pwd, establish this files address them 
   mkdir -p that address while changing the first directoryname in
   path to correct quarter.  
   
   Then use perls rename to put the file at the end of the newly
   created (or existing) address
   
This process would have to be carried out on each file.  In the
current case that is 500,000+ but it would be much smaller next time
around.

The current archive wound not need to be concerned about year, but that
shouldn't be to tough to add to the regex either unless the dates I
have have lots of different weird date syntax.



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Diffinitive archiving method sought - Big prize money for best entrant
  2002-09-06 21:42   ` Harry Putnam
@ 2002-09-07 19:04     ` Kai Großjohann
  0 siblings, 0 replies; 4+ messages in thread
From: Kai Großjohann @ 2002-09-07 19:04 UTC (permalink / raw)
  Cc: ding

Harry Putnam <reader@newsguy.com> writes:

> I think I understand the proceedure Kai, but it sounds even more
> labor intensive than anything I had come up with.  My whole aim is to
> find a lazy way to do it.
>
> Piecing out the overview files and such doesn't fit into my lazy man
> scheme.. hehe.

Well, from my perspective it's all fairly easy, except teaching Gnus
about the articles that have gone missing.

But here is another, wicked, idea which does not involve Gnus at all
and probably allows you to do it with almost no work:

Let there be two machines, `work' and `archive'.  Set up a leafnode
on work and a leafnode on archive.  The leafnode on work fetches
articles from wherever you fetch articles from.  The leafnode on
archive fetches articles from work.

Now tell (the leafnode on) work to expire articles after 3 months and
tell (the leafnode on) archive to never expire articles.

Done!

kai
-- 
A large number of young women don't trust men with beards.  (BFBS Radio)



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2002-09-07 19:04 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-05  2:17 Diffinitive archiving method sought - Big prize money for best entrant Harry Putnam
2002-09-06 15:50 ` Kai Großjohann
2002-09-06 21:42   ` Harry Putnam
2002-09-07 19:04     ` Kai Großjohann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).