The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
* [TUHS] ML archive
@ 2016-07-02 11:00 Michael Welle
  2016-07-02 21:07 ` Random832
  0 siblings, 1 reply; 3+ messages in thread
From: Michael Welle @ 2016-07-02 11:00 UTC (permalink / raw)


Hello,

I want to complete my local ML archive (I deleted a few emails and I
wasn't subscribed before 2001 or so I think). After downloading the
archives and hitting them a few times to get somewhat importable mboxes,
I ended with 8699 emails in a maildir (in theory that should be a
superset of the 5027 emails in my regular TUHS maildir. I will merge
them next.). Two dozens mails are obviously defective (can be repaired
manually maybe) and some more might be defective (needs deeper
checking). So, has anybody more ;)?

Regards
hmw


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [TUHS] ML archive
  2016-07-02 11:00 [TUHS] ML archive Michael Welle
@ 2016-07-02 21:07 ` Random832
  2016-07-03  5:46   ` Michael Welle
  0 siblings, 1 reply; 3+ messages in thread
From: Random832 @ 2016-07-02 21:07 UTC (permalink / raw)


On Sat, Jul 2, 2016, at 07:00, Michael Welle wrote:
> Hello,
> 
> I want to complete my local ML archive (I deleted a few emails and I
> wasn't subscribed before 2001 or so I think). After downloading the
> archives and hitting them a few times to get somewhat importable mboxes,
> I ended with 8699 emails in a maildir (in theory that should be a
> superset of the 5027 emails in my regular TUHS maildir. I will merge
> them next.). Two dozens mails are obviously defective (can be repaired
> manually maybe) and some more might be defective (needs deeper
> checking). So, has anybody more ;)?

Gah, the archive files from before 2002 are a nightmare.

My best guess on the proper message count is 8675. This is the number of
blocks of non-blank lines which either start with "From " (the easy
case) or, for the hard case, meet the following conditions:

In a file dated 2001 or earlier
First line contains a colon
Contains at least one line starting with "Received:"

There might still be a handful of messages this fails to split out. I
would recommend interpreting files dated October 2001 or later as a
strict MBOXO archive, and only doing special processing to files dated
September 2001 or earlier (I haven't factored out my own script to be
able to do anything other than count messages)


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [TUHS] ML archive
  2016-07-02 21:07 ` Random832
@ 2016-07-03  5:46   ` Michael Welle
  0 siblings, 0 replies; 3+ messages in thread
From: Michael Welle @ 2016-07-03  5:46 UTC (permalink / raw)


Hello,

Random832 <random832 at fastmail.com> writes:

> On Sat, Jul 2, 2016, at 07:00, Michael Welle wrote:
>> Hello,
>> 
>> I want to complete my local ML archive (I deleted a few emails and I
>> wasn't subscribed before 2001 or so I think). After downloading the
>> archives and hitting them a few times to get somewhat importable mboxes,
>> I ended with 8699 emails in a maildir (in theory that should be a
>> superset of the 5027 emails in my regular TUHS maildir. I will merge
>> them next.). Two dozens mails are obviously defective (can be repaired
>> manually maybe) and some more might be defective (needs deeper
>> checking). So, has anybody more ;)?
>
> Gah, the archive files from before 2002 are a nightmare.
yepp, it gets better with later archive files. In the first run I have
changed: 

1995-October.txt, 1995-November.txt, 1995-December.txt,
1996-March.txt, 1996-September.txt, 1996-November.txt,
1997-August.txt, 1997-September.txt, 1997-October.txt,
1997-November.txt, 1998-April.txt, 1998-February.txt,
1998-March.txt, 1998-May.txt, 1998-August.txt,
1998-November.txt, 1998-December.txt, 1999-January.txt,
1999-February.txt, 1999-March.txt, 1999-May.txt,
1999-June.txt, 1999-August.txt, 1999-September.txt, 
1999-November.txt, 1999-October.txt, 1999-December.txt,
2000-January.txt, 2000-February.txt, 2000-April.txt,
2000-May.txt, 2000-July.txt, 2000-June.txt,
2000-August.txt, 2000-October.txt, 2001-January.txt,
2001-February.txt, 2001-March.txt, 2001-April.txt,
2001-May.txt, 2002-October.txt


> My best guess on the proper message count is 8675. This is the number of
> blocks of non-blank lines which either start with "From " (the easy
> case) or, for the hard case, meet the following conditions:
>
> In a file dated 2001 or earlier
> First line contains a colon
> Contains at least one line starting with "Received:"
I started with an empty line followed by '^(Date|From|To|Message-Id|Received):'.
There is at least one match of a 'Date:' in the message's body and there
are cases where people appended emails incl. headers to their emails. So
a little bit more tweaking is needed. 


> There might still be a handful of messages this fails to split out. I
> would recommend interpreting files dated October 2001 or later as a
> strict MBOXO archive, and only doing special processing to files dated
> September 2001 or earlier (I haven't factored out my own script to be
> able to do anything other than count messages)
One other thing I haven't made my mind up about is date headers. Some
emails don't have a date header. I feel a bit uneasy about manipulating
the emails and add a date header. On the other hand an approximate date 
header would allow to sort the emails in the clients and give them a bit
more context. Any opinions on that?

Regards
hmw


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-07-03  5:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-02 11:00 [TUHS] ML archive Michael Welle
2016-07-02 21:07 ` Random832
2016-07-03  5:46   ` Michael Welle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).