From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/58231 Path: main.gmane.org!not-for-mail From: Harry Putnam Newsgroups: gmane.emacs.gnus.general Subject: OT [Archive techniques] What to do when it gets massive Date: Wed, 11 Aug 2004 20:34:02 -0500 Organization: Still searching... Sender: ding-owner@lists.math.uh.edu Message-ID: NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1092274745 21504 80.91.224.253 (12 Aug 2004 01:39:05 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Thu, 12 Aug 2004 01:39:05 +0000 (UTC) Original-X-From: ding-owner+M6772@lists.math.uh.edu Thu Aug 12 03:38:57 2004 Return-path: Original-Received: from malifon.math.uh.edu ([129.7.128.13]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1Bv4YT-0002h1-00 for ; Thu, 12 Aug 2004 03:38:57 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.math.uh.edu ident=lists) by malifon.math.uh.edu with smtp (Exim 3.20 #1) id 1Bv4X1-0002Vf-00; Wed, 11 Aug 2004 20:37:27 -0500 Original-Received: from util2.math.uh.edu ([129.7.128.23]) by malifon.math.uh.edu with esmtp (Exim 3.20 #1) id 1Bv4Wt-0002VZ-00 for ding@lists.math.uh.edu; Wed, 11 Aug 2004 20:37:19 -0500 Original-Received: from justine.libertine.org ([66.139.78.221] ident=postfix) by util2.math.uh.edu with esmtp (Exim 4.30) id 1Bv4Wt-000371-Au for ding@lists.math.uh.edu; Wed, 11 Aug 2004 20:37:19 -0500 Original-Received: from main.gmane.org (main.gmane.org [80.91.224.249]) by justine.libertine.org (Postfix) with ESMTP id 8E0FE3A0066 for ; Wed, 11 Aug 2004 20:37:18 -0500 (CDT) Original-Received: from list by main.gmane.org with local (Exim 3.35 #1 (Debian)) id 1Bv4Wq-0000y9-00 for ; Thu, 12 Aug 2004 03:37:16 +0200 Original-Received: from adsl-68-74-156-137.dsl.emhril.ameritech.net ([68.74.156.137]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 12 Aug 2004 03:37:16 +0200 Original-Received: from reader by adsl-68-74-156-137.dsl.emhril.ameritech.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 12 Aug 2004 03:37:16 +0200 X-Injected-Via-Gmane: http://gmane.org/ Original-To: ding@gnus.org Original-Lines: 44 Original-X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: adsl-68-74-156-137.dsl.emhril.ameritech.net User-Agent: Gnus/5.110003 (No Gnus v0.3) Emacs/21.3.50 (gnu/linux) Cancel-Lock: sha1:tuqRbMbQQ6kay33ZEtkQZi57AmE= Precedence: bulk Xref: main.gmane.org gmane.emacs.gnus.general:58231 X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:58231 I've been archiving a changing list of nntp and mail messages for a very long time. Partially to have something to test various search techniques against that uses a hefty amount of data to search. I've never really hit on a good method for doing this. I started with rsync and still use it like this: Run rsync against ~/News/agent/nntp using an exclude file that keeps out anything but the directories and messages, into a mirror of those directories. The result is that as new messages come in and old are expired from ~/News they accumulate on /arch/news. At some point the size is so large as to make any commands run against the massive heap of data take a long time. I'd like to break this pile up somehow, but will work on that later. Right now I'd like to start rsyncing to dated mirrors one month at a time. However I see no way to do this without having major overlap. Example: Agent downloads for a month and I have a large accumulation under News/agent/nntp. These have been getting rsynced to this months mirror. Now when I change over to a new month, and start feeding a new empty mirror all the messages under News...nntp are copied there unless I empty out News/agent/nntp, but even then without some hand work of some kind, the agent will download what ever is still on the server in the initial run, many of which will be overlaps. Actually the vast majority will. Rsync seems to no kind of `newer' type thing like find has. I've wondered if I just removed all the numbered files but left the .agentview files in place if the agent will just continue with only new messages it hasn't seen. If that is the case, then that would be one way to do it. That way would leave only one major inconvenience. I'd have no backlog of messages in any groups for a while in case I wanted to A T a thread or do a search or something. I'm betting some of the seasoned troopers here have some much better ways of doing this. Answers of `use google instead' or search.gmane.org instead are not accepted... hehe.