spamtrap@koldfront.dk (Adam Sjøgren) writes:

> My results are:
>
>  Jesper Harder (spamwash.el+patch): 0.995423
>  Original spam-stat.el (ngnus-0.1): 0.995664
>  Andrew Cohen (patch)             : 0.997591
>
> Each time I installing the version to be tested, training and then ran
> spam-stat-test-directory on a spam-group with 4151 emails in it.

This is what I get with Andrew's latest version and the attached
version of spamwash[1]:

                  Spam                   Ham           Time
  -----------------------------------------------------------
  none        4900/5286 = 0.927    2/1641 = 0.0012     215 s
  spam-wash   5136/5286 = 0.972    2/1641 = 0.0012     486 s
  spamwash    5159/5286 = 0.976    3/1641 = 0.0018     394 s

The difference in detection rate between the two washers is probably
not large enough to be statistically significant.  And it's hardly
surprising that they're very close since they do nearly the same
thing.

The only major difference in output is that spamwash doesn't delete
the MIME headers:

   ------=_NextPart_1169_0527773208410
   Content-Type: text/html;
	  charset=iso-8859-1
   Content-Transfer-Encoding: Quoted-Printable

I don't know if it's the case, but some of that information might be
useful for the Bayesian filter.

Another possible advantage is that I think it's easier to customize
for users.  For example, if you wanted to wash HTML with Lynx before
analysis (to defeat poison words inserted as HTML comments) you could
write something like

    (defun spamwash-treat-html (cte ctl)
      (spamwash-decode-body cte ctl)
        (let ((func (cdr (assq 'lynx mm-text-html-washer-alist))))
            (apply (car func) (cdr func))))

and add ("text/html" . spamwash-treat-html) to
`spamwash-treatment-alist'.

An advantage with Andrew's code is that it's based on better tested
and debugged code.


[1] I stripped the "Xref" header before training (committed).
    Otherwise the prediction rate is too optimistic.

    I also did (define-coding-system-alias 'ks_c_5601-1987 'euc-kr),
    which helped both washers quite a bit.  ks_c_5601-1987 seems to be
    an alias or superset of euc-kr (does someone know?).  Maybe we
    should add it to `mm-charset-synonym-alist'.