Gnus development mailing list
 help / color / mirror / Atom feed
From: Jesper Harder <harder@ifa.au.dk>
Subject: Re: spam-stat.el and mime
Date: Thu, 22 Jan 2004 08:30:19 +0100	[thread overview]
Message-ID: <m38yk0e8z8.fsf@defun.localdomain> (raw)
In-Reply-To: <87vfn5owzq.fsf@virgil.koldfront.dk> (Adam =?iso-8859-1?q?Sj=F8gren's?= message of "Wed, 21 Jan 2004 21:41:29 +0100")

[-- Attachment #1: Type: text/plain, Size: 2299 bytes --]

spamtrap@koldfront.dk (Adam Sjøgren) writes:

> My results are:
>
>  Jesper Harder (spamwash.el+patch): 0.995423
>  Original spam-stat.el (ngnus-0.1): 0.995664
>  Andrew Cohen (patch)             : 0.997591
>
> Each time I installing the version to be tested, training and then ran
> spam-stat-test-directory on a spam-group with 4151 emails in it.

This is what I get with Andrew's latest version and the attached
version of spamwash[1]:

                  Spam                   Ham           Time
  -----------------------------------------------------------
  none        4900/5286 = 0.927    2/1641 = 0.0012     215 s
  spam-wash   5136/5286 = 0.972    2/1641 = 0.0012     486 s
  spamwash    5159/5286 = 0.976    3/1641 = 0.0018     394 s

The difference in detection rate between the two washers is probably
not large enough to be statistically significant.  And it's hardly
surprising that they're very close since they do nearly the same
thing.

The only major difference in output is that spamwash doesn't delete
the MIME headers:

   ------=_NextPart_1169_0527773208410
   Content-Type: text/html;
	  charset=iso-8859-1
   Content-Transfer-Encoding: Quoted-Printable

I don't know if it's the case, but some of that information might be
useful for the Bayesian filter.

Another possible advantage is that I think it's easier to customize
for users.  For example, if you wanted to wash HTML with Lynx before
analysis (to defeat poison words inserted as HTML comments) you could
write something like

    (defun spamwash-treat-html (cte ctl)
      (spamwash-decode-body cte ctl)
        (let ((func (cdr (assq 'lynx mm-text-html-washer-alist))))
            (apply (car func) (cdr func))))

and add ("text/html" . spamwash-treat-html) to
`spamwash-treatment-alist'.

An advantage with Andrew's code is that it's based on better tested
and debugged code.


[1] I stripped the "Xref" header before training (committed).
    Otherwise the prediction rate is too optimistic.

    I also did (define-coding-system-alias 'ks_c_5601-1987 'euc-kr),
    which helped both washers quite a bit.  ks_c_5601-1987 seems to be
    an alias or superset of euc-kr (does someone know?).  Maybe we
    should add it to `mm-charset-synonym-alist'.


[-- Attachment #2: spamwash.el --]
[-- Type: application/emacs-lisp, Size: 2843 bytes --]

  reply	other threads:[~2004-01-22  7:30 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-01-10 16:43 Andrew Cohen
2004-01-11 20:37 ` Adam Sjøgren
2004-01-12  4:47 ` Jesper Harder
2004-01-12 21:37 ` Ted Zlatanov
2004-01-13 19:42 ` Adam Sjøgren
2004-01-20  5:56 ` Jesper Harder
2004-01-21  0:17   ` Ted Zlatanov
2004-01-21 20:41   ` Adam Sjøgren
2004-01-22  7:30     ` Jesper Harder [this message]
2004-01-22 13:49       ` Reiner Steib
2004-01-23  1:15       ` Jesper Harder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m38yk0e8z8.fsf@defun.localdomain \
    --to=harder@ifa.au.dk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).