Gnus development mailing list
 help / color / mirror / Atom feed
* Major splitting problem ... Advice please
@ 2001-10-11  5:26 Harry Putnam
  2001-10-11  7:40 ` Kai Großjohann
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Harry Putnam @ 2001-10-11  5:26 UTC (permalink / raw)



I've become something of an amateur archivist over the past few years
mainly due to a desire to have a big pile of usenet available for
searching, and for running various programming experiments in such
languages as awk and perl.

That said, I've decided my 250,000+ plus messages archive of some
dozen or so newsgroups is just too unwieldy for easy use, in its
present form.  (The way it came off the nntp servers)

Groups are too vast to be really usable inside of gnus, unless most of
the nifty formating, threading etc is foregone, then why bother
really.

I currently use command line tools or homemade scripting to extract
info from this pile, but it would be nice to be easily able to access
it with gnus at times too.  By `access' here, I don't mean nndir or
the like.  But handy smallish groups that handle well inside
gnus. Where all manner of highlight or other special treatment/sorting
wouldn't be a major time drag. Maybe a series of nnml groups for each
main newsgroup or something.

To cut to the chase here, I'm thinking of splitting this up into
groups that contain one month/yr of a specific group.  

However, there are enough differnet date styles to make that kind of
split pretty hard to program.  Also the problem of some messages that
came late to a thread, landing in a different group arises.  Keeping
all thead members in one group may not even be possible, except by
hand.  I'm not sure. 

Splitting on year would be easy enough but would still result in
groups too big for handy use. I'm thinking maybe something based on
file names?  These messages have there original file names as they
came from the server (for the most part).

Or  maybe just split  it up  into groups  of 2000  or less  under each
newsgroup,  not paying  attention  to date,  or  worrying about  split
threads.   (Is  `A T'  capable  of  pulling messages  from  other  `nnml'
groups?)

Wondering if some of the card carrying archivists here, like maybe
Karl K. could outline a summary of how they would do something like
this.  Or any suggestions from anyone that has seen various setups or
has experience of some kind with a problem like this.

I mean just off the top of respective heads. 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Major splitting problem ... Advice please
  2001-10-11  5:26 Major splitting problem ... Advice please Harry Putnam
@ 2001-10-11  7:40 ` Kai Großjohann
  2001-10-12  4:01   ` Harry Putnam
  2001-10-11 12:02 ` Karl Kleinpaste
  2001-10-12 16:44 ` Rob Browning
  2 siblings, 1 reply; 9+ messages in thread
From: Kai Großjohann @ 2001-10-11  7:40 UTC (permalink / raw)
  Cc: ding

Harry Putnam <reader@newsguy.com> writes:

> I currently use command line tools or homemade scripting to extract
> info from this pile, but it would be nice to be easily able to access
> it with gnus at times too.  By `access' here, I don't mean nndir or
> the like.  But handy smallish groups that handle well inside
> gnus. Where all manner of highlight or other special treatment/sorting
> wouldn't be a major time drag. Maybe a series of nnml groups for each
> main newsgroup or something.

Maybe the best thing to do is to write a script which extracts
threads.  Then you just keep adding threads to a group until it gets
larger than N messages, then start the next group.

Let T be an empty thread.  You find a message R (the root of the
thread) which does not have a References header.  You add it to the
thread.  So T now contains one message.  Let M be the set of
Message-ID headers for all messages in T.  You find a new message X
and you see that its References header mentions one of the Message-IDs
in M.  So you add X to T, and you add X's Message-ID to M, and repeat.

See?

kai
-- 
Linux provides a nice `poweroff' command, but where is `poweron'?



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Major splitting problem ... Advice please
  2001-10-11  5:26 Major splitting problem ... Advice please Harry Putnam
  2001-10-11  7:40 ` Kai Großjohann
@ 2001-10-11 12:02 ` Karl Kleinpaste
  2001-10-11 15:54   ` Paul Jarc
  2001-10-12 16:44 ` Rob Browning
  2 siblings, 1 reply; 9+ messages in thread
From: Karl Kleinpaste @ 2001-10-11 12:02 UTC (permalink / raw)


Harry Putnam <reader@newsguy.com> writes:
> To cut to the chase here, I'm thinking of splitting this up into
> groups that contain one month/yr of a specific group.  

for year in 1995 1996 1997 1998 1998 2000 2001
do
  for month in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  do
    newdir=NewArchive/$year/$month
    mkdir -p $newdir
    grep -isl "^Date:.*$month.*$year" message/* |
    while read article
    do
      mv $article $newdir
    done
  done
done

Embellish to taste, if e.g. the messages do not have unique names
across a set of directories.

It's too bad that xargs(1) can't be used following the grep; the inner
"while" loop could be disposed of entirely if so, but that's not how
mv(1) works.

Er...well, it's icky, but...

function newmv()
{
  destdir=$1
  shift
  mv "$@" $destdir
}

Then the "while" is replaced by
     grep -isl ... | xargs newmv $newdir
which perhaps isn't all that icky after all.

> However, there are enough differnet date styles to make that kind of
> split pretty hard to program.

If there are enough odd (broken) date formats so as not to be caught
by this, then after this is run, go back and work out new variants for
the "for" loops.  Repeat "for" with ever newer and weirder date
discriminants until there's nothing left to move.

> Also the problem of some messages that
> came late to a thread, landing in a different group arises.  Keeping
> all thead members in one group may not even be possible, except by
> hand.  I'm not sure. 

As soon as you decide to use date-based storage, you break either that
storage mechanism or you break border-crossing threads.  Pick one or
the other.

OTOH -- and I know we've been over this ground before -- I've become
so attached to nnir & swish++ that I would leave the groups in
whatever huge collections you've got and simply never enter them
directly, but rather do nnir queries to pick up what I need.  swish++
is _fast_.  Periodically run nnml-generate-nov-databases to keep the
overviews current, if you continue to add messages to these archives.

--karl



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Major splitting problem ... Advice please
  2001-10-11 12:02 ` Karl Kleinpaste
@ 2001-10-11 15:54   ` Paul Jarc
  2001-10-11 16:25     ` Paul Jarc
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Jarc @ 2001-10-11 15:54 UTC (permalink / raw)


Karl Kleinpaste <karl@charcoal.com> wrote:
> It's too bad that xargs(1) can't be used following the grep; the inner
> "while" loop could be disposed of entirely if so, but that's not how
> mv(1) works.

xargs -n1 should work.

> function newmv()
> {
>   destdir=$1
>   shift
>   mv "$@" $destdir
> }
>
> Then the "while" is replaced by
>      grep -isl ... | xargs newmv $newdir
> which perhaps isn't all that icky after all.

Assuming you make newmv a shell script and not just a shell function
in the current shell.

>> Also the problem of some messages that
>> came late to a thread, landing in a different group arises.  Keeping
>> all thead members in one group may not even be possible, except by
>> hand.  I'm not sure. 
>
> As soon as you decide to use date-based storage, you break either that
> storage mechanism or you break border-crossing threads.  Pick one or
> the other.

Yes... but if you could choose the date boundaries on the fly, you
could get whole threads for all the threads you're interested in.
Is there anything that can do this?


paul



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Major splitting problem ... Advice please
  2001-10-11 15:54   ` Paul Jarc
@ 2001-10-11 16:25     ` Paul Jarc
  2001-10-11 16:37       ` Kai Großjohann
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Jarc @ 2001-10-11 16:25 UTC (permalink / raw)


I wrote:
> Karl Kleinpaste <karl@charcoal.com> wrote:
>> It's too bad that xargs(1) can't be used following the grep; the inner
>> "while" loop could be disposed of entirely if so, but that's not how
>> mv(1) works.
>
> xargs -n1 should work.

Er.  Except the order of mv's arguments would be wrong.  Then, since
we need a little shell script anyway, -n1 wouldn't be very useful.


paul



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Major splitting problem ... Advice please
  2001-10-11 16:25     ` Paul Jarc
@ 2001-10-11 16:37       ` Kai Großjohann
  0 siblings, 0 replies; 9+ messages in thread
From: Kai Großjohann @ 2001-10-11 16:37 UTC (permalink / raw)


prj@po.cwru.edu (Paul Jarc) writes:

> Er.  Except the order of mv's arguments would be wrong.  Then, since
> we need a little shell script anyway, -n1 wouldn't be very useful.

for i in foo bar baz; do echo $i; done | xargs -n1 -i@ echo xx @

kai
-- 
Linux provides a nice `poweroff' command, but where is `poweron'?



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Major splitting problem ... Advice please
  2001-10-11  7:40 ` Kai Großjohann
@ 2001-10-12  4:01   ` Harry Putnam
  0 siblings, 0 replies; 9+ messages in thread
From: Harry Putnam @ 2001-10-12  4:01 UTC (permalink / raw)


Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:
[...]

Karl Kleinpaste <karl@charcoal.com> writes:
[...]

OK ... thanks for the good clues... Still debating with myself...



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Major splitting problem ... Advice please
  2001-10-11  5:26 Major splitting problem ... Advice please Harry Putnam
  2001-10-11  7:40 ` Kai Großjohann
  2001-10-11 12:02 ` Karl Kleinpaste
@ 2001-10-12 16:44 ` Rob Browning
  2001-10-13  4:28   ` Harry Putnam
  2 siblings, 1 reply; 9+ messages in thread
From: Rob Browning @ 2001-10-12 16:44 UTC (permalink / raw)
  Cc: ding

Harry Putnam <reader@newsguy.com> writes:

> To cut to the chase here, I'm thinking of splitting this up into
> groups that contain one month/yr of a specific group.  

Hmm, I've been working on something similar off and on.  I was only
working on per-year splitting, and I hadn't yet decided what, if
anything, I was going to try to do about breaking threads, but I was
trying to write an elisp function using gnus calls to do the job
because I wanted to preserve all my marks.

> However, there are enough differnet date styles to make that kind of
> split pretty hard to program.  Also the problem of some messages
> that came late to a thread, landing in a different group arises.
> Keeping all thead members in one group may not even be possible,
> except by hand.  I'm not sure.

Hmm.  I had just been planning to use gnus date functions.  I hadn't
considered that those might not be sufficient.

-- 
Rob Browning
rlb @defaultvalue.org, @linuxdevel.com, and @debian.org
Previously @cs.utexas.edu
GPG=1C58 8B2C FB5E 3F64 EA5C  64AE 78FE E5FE F0CB A0AD



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Major splitting problem ... Advice please
  2001-10-12 16:44 ` Rob Browning
@ 2001-10-13  4:28   ` Harry Putnam
  0 siblings, 0 replies; 9+ messages in thread
From: Harry Putnam @ 2001-10-13  4:28 UTC (permalink / raw)


Rob Browning <rlb@defaultvalue.org> writes:

>> However, there are enough differnet date styles to make that kind of
>> split pretty hard to program.  Also the problem of some messages
>> that came late to a thread, landing in a different group arises.
>> Keeping all thead members in one group may not even be possible,
>> except by hand.  I'm not sure.
>
> Hmm.  I had just been planning to use gnus date functions.  I hadn't
> considered that those might not be sufficient.

My comments may have been a little misleading.  They were directed at
the idea of splitting messages by date with tools such as awk and
perl.  What I was getting at was a certain amount of difficulty
getting regular expressions that match all possible date formulations
like these (Taken from a sample of headers on comp.unix.solaris):

 Date: 24 Sep 2001 09:07:45 GMT
 Date: Mon, 8 Oct 2001 15:30:18 +0100
 Date: 8 Oct 2001 14:30:26 GMT
 Date: 08 Oct 2001 16:42:08 +0200
 Date: Sun, 7 Oct 2001 20:02:06 +0200
 Date: Sun, 07 Oct 2001 17:45:17 GMT


There are some even odder formulations to be found.  Probably not
impossible to set regexp that will work for them all, but just a pita.

If you plan to use the date functions that do limiting like these:
 `/ t' and 'C-u / t'  It may not be a problem.  I wanted to do the
 splitting outside gnus because it is such a large archive.

(app 250,000  messages, from about a dozen groups)

I haven't tried this but I suspect one could do this by first setting
up a nnmail split methods that splits by date to mnemonic named
groups.  Then entering the monster groups and split them with `M P a
<RET> B r <RET>. Adjusting the spit method rules for each group, But
here again, I would expect some extensive experimentation getting the
date regexp right.  And it would be very time intensive to do that
inside gnus I think, assuming the groups are above 25,000 or so.



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2001-10-13  4:28 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-10-11  5:26 Major splitting problem ... Advice please Harry Putnam
2001-10-11  7:40 ` Kai Großjohann
2001-10-12  4:01   ` Harry Putnam
2001-10-11 12:02 ` Karl Kleinpaste
2001-10-11 15:54   ` Paul Jarc
2001-10-11 16:25     ` Paul Jarc
2001-10-11 16:37       ` Kai Großjohann
2001-10-12 16:44 ` Rob Browning
2001-10-13  4:28   ` Harry Putnam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).