spamtrap@koldfront.dk (Adam Sjøgren) writes: > My results are: > > Jesper Harder (spamwash.el+patch): 0.995423 > Original spam-stat.el (ngnus-0.1): 0.995664 > Andrew Cohen (patch) : 0.997591 > > Each time I installing the version to be tested, training and then ran > spam-stat-test-directory on a spam-group with 4151 emails in it. This is what I get with Andrew's latest version and the attached version of spamwash[1]: Spam Ham Time ----------------------------------------------------------- none 4900/5286 = 0.927 2/1641 = 0.0012 215 s spam-wash 5136/5286 = 0.972 2/1641 = 0.0012 486 s spamwash 5159/5286 = 0.976 3/1641 = 0.0018 394 s The difference in detection rate between the two washers is probably not large enough to be statistically significant. And it's hardly surprising that they're very close since they do nearly the same thing. The only major difference in output is that spamwash doesn't delete the MIME headers: ------=_NextPart_1169_0527773208410 Content-Type: text/html; charset=iso-8859-1 Content-Transfer-Encoding: Quoted-Printable I don't know if it's the case, but some of that information might be useful for the Bayesian filter. Another possible advantage is that I think it's easier to customize for users. For example, if you wanted to wash HTML with Lynx before analysis (to defeat poison words inserted as HTML comments) you could write something like (defun spamwash-treat-html (cte ctl) (spamwash-decode-body cte ctl) (let ((func (cdr (assq 'lynx mm-text-html-washer-alist)))) (apply (car func) (cdr func)))) and add ("text/html" . spamwash-treat-html) to `spamwash-treatment-alist'. An advantage with Andrew's code is that it's based on better tested and debugged code. [1] I stripped the "Xref" header before training (committed). Otherwise the prediction rate is too optimistic. I also did (define-coding-system-alias 'ks_c_5601-1987 'euc-kr), which helped both washers quite a bit. ks_c_5601-1987 seems to be an alias or superset of euc-kr (does someone know?). Maybe we should add it to `mm-charset-synonym-alist'.