The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
From: mwe012008@gmx.net (Michael Welle)
Subject: [TUHS] ML archive
Date: Sun, 03 Jul 2016 07:46:14 +0200	[thread overview]
Message-ID: <87oa6fuvdl.fsf@luisa.c0t0d0s0.de> (raw)
In-Reply-To: <1467493624.1873121.655162457.687FFA51@webmail.messagingengine.com> (Random's message of "Sat, 02 Jul 2016 17:07:04 -0400")

Hello,

Random832 <random832 at fastmail.com> writes:

> On Sat, Jul 2, 2016, at 07:00, Michael Welle wrote:
>> Hello,
>> 
>> I want to complete my local ML archive (I deleted a few emails and I
>> wasn't subscribed before 2001 or so I think). After downloading the
>> archives and hitting them a few times to get somewhat importable mboxes,
>> I ended with 8699 emails in a maildir (in theory that should be a
>> superset of the 5027 emails in my regular TUHS maildir. I will merge
>> them next.). Two dozens mails are obviously defective (can be repaired
>> manually maybe) and some more might be defective (needs deeper
>> checking). So, has anybody more ;)?
>
> Gah, the archive files from before 2002 are a nightmare.
yepp, it gets better with later archive files. In the first run I have
changed: 

1995-October.txt, 1995-November.txt, 1995-December.txt,
1996-March.txt, 1996-September.txt, 1996-November.txt,
1997-August.txt, 1997-September.txt, 1997-October.txt,
1997-November.txt, 1998-April.txt, 1998-February.txt,
1998-March.txt, 1998-May.txt, 1998-August.txt,
1998-November.txt, 1998-December.txt, 1999-January.txt,
1999-February.txt, 1999-March.txt, 1999-May.txt,
1999-June.txt, 1999-August.txt, 1999-September.txt, 
1999-November.txt, 1999-October.txt, 1999-December.txt,
2000-January.txt, 2000-February.txt, 2000-April.txt,
2000-May.txt, 2000-July.txt, 2000-June.txt,
2000-August.txt, 2000-October.txt, 2001-January.txt,
2001-February.txt, 2001-March.txt, 2001-April.txt,
2001-May.txt, 2002-October.txt


> My best guess on the proper message count is 8675. This is the number of
> blocks of non-blank lines which either start with "From " (the easy
> case) or, for the hard case, meet the following conditions:
>
> In a file dated 2001 or earlier
> First line contains a colon
> Contains at least one line starting with "Received:"
I started with an empty line followed by '^(Date|From|To|Message-Id|Received):'.
There is at least one match of a 'Date:' in the message's body and there
are cases where people appended emails incl. headers to their emails. So
a little bit more tweaking is needed. 


> There might still be a handful of messages this fails to split out. I
> would recommend interpreting files dated October 2001 or later as a
> strict MBOXO archive, and only doing special processing to files dated
> September 2001 or earlier (I haven't factored out my own script to be
> able to do anything other than count messages)
One other thing I haven't made my mind up about is date headers. Some
emails don't have a date header. I feel a bit uneasy about manipulating
the emails and add a date header. On the other hand an approximate date 
header would allow to sort the emails in the clients and give them a bit
more context. Any opinions on that?

Regards
hmw


      reply	other threads:[~2016-07-03  5:46 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-02 11:00 Michael Welle
2016-07-02 21:07 ` Random832
2016-07-03  5:46   ` Michael Welle [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87oa6fuvdl.fsf@luisa.c0t0d0s0.de \
    --to=mwe012008@gmx.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).