From mboxrd@z Thu Jan 1 00:00:00 1970 From: mwe012008@gmx.net (Michael Welle) Date: Sun, 03 Jul 2016 07:46:14 +0200 Subject: [TUHS] ML archive In-Reply-To: <1467493624.1873121.655162457.687FFA51@webmail.messagingengine.com> (Random's message of "Sat, 02 Jul 2016 17:07:04 -0400") References: <87inwo1f01.fsf@luisa.c0t0d0s0.de> <1467493624.1873121.655162457.687FFA51@webmail.messagingengine.com> Message-ID: <87oa6fuvdl.fsf@luisa.c0t0d0s0.de> Hello, Random832 writes: > On Sat, Jul 2, 2016, at 07:00, Michael Welle wrote: >> Hello, >> >> I want to complete my local ML archive (I deleted a few emails and I >> wasn't subscribed before 2001 or so I think). After downloading the >> archives and hitting them a few times to get somewhat importable mboxes, >> I ended with 8699 emails in a maildir (in theory that should be a >> superset of the 5027 emails in my regular TUHS maildir. I will merge >> them next.). Two dozens mails are obviously defective (can be repaired >> manually maybe) and some more might be defective (needs deeper >> checking). So, has anybody more ;)? > > Gah, the archive files from before 2002 are a nightmare. yepp, it gets better with later archive files. In the first run I have changed: 1995-October.txt, 1995-November.txt, 1995-December.txt, 1996-March.txt, 1996-September.txt, 1996-November.txt, 1997-August.txt, 1997-September.txt, 1997-October.txt, 1997-November.txt, 1998-April.txt, 1998-February.txt, 1998-March.txt, 1998-May.txt, 1998-August.txt, 1998-November.txt, 1998-December.txt, 1999-January.txt, 1999-February.txt, 1999-March.txt, 1999-May.txt, 1999-June.txt, 1999-August.txt, 1999-September.txt, 1999-November.txt, 1999-October.txt, 1999-December.txt, 2000-January.txt, 2000-February.txt, 2000-April.txt, 2000-May.txt, 2000-July.txt, 2000-June.txt, 2000-August.txt, 2000-October.txt, 2001-January.txt, 2001-February.txt, 2001-March.txt, 2001-April.txt, 2001-May.txt, 2002-October.txt > My best guess on the proper message count is 8675. This is the number of > blocks of non-blank lines which either start with "From " (the easy > case) or, for the hard case, meet the following conditions: > > In a file dated 2001 or earlier > First line contains a colon > Contains at least one line starting with "Received:" I started with an empty line followed by '^(Date|From|To|Message-Id|Received):'. There is at least one match of a 'Date:' in the message's body and there are cases where people appended emails incl. headers to their emails. So a little bit more tweaking is needed. > There might still be a handful of messages this fails to split out. I > would recommend interpreting files dated October 2001 or later as a > strict MBOXO archive, and only doing special processing to files dated > September 2001 or earlier (I haven't factored out my own script to be > able to do anything other than count messages) One other thing I haven't made my mind up about is date headers. Some emails don't have a date header. I feel a bit uneasy about manipulating the emails and add a date header. On the other hand an approximate date header would allow to sort the emails in the clients and give them a bit more context. Any opinions on that? Regards hmw