From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/56153 Path: main.gmane.org!not-for-mail From: Jesper Harder Newsgroups: gmane.emacs.gnus.general Subject: Re: spam-stat.el and mime Date: Thu, 22 Jan 2004 08:30:19 +0100 Sender: ding-owner@lists.math.uh.edu Message-ID: References: <87u133g3f4.fsf@andy.bu.edu> <87vfn5owzq.fsf@virgil.koldfront.dk> NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: sea.gmane.org 1074756739 13353 80.91.224.253 (22 Jan 2004 07:32:19 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Thu, 22 Jan 2004 07:32:19 +0000 (UTC) Original-X-From: ding-owner+M4693@lists.math.uh.edu Thu Jan 22 08:32:10 2004 Return-path: Original-Received: from malifon.math.uh.edu ([129.7.128.13]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1AjZJx-00071T-00 for ; Thu, 22 Jan 2004 08:32:09 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.math.uh.edu) by malifon.math.uh.edu with smtp (Exim 3.20 #1) id 1AjZJZ-0000qN-00; Thu, 22 Jan 2004 01:31:45 -0600 Original-Received: from justine.libertine.org ([66.139.78.221] ident=postfix) by malifon.math.uh.edu with esmtp (Exim 3.20 #1) id 1AjZJQ-0000qH-00 for ding@lists.math.uh.edu; Thu, 22 Jan 2004 01:31:36 -0600 Original-Received: from pfepa.post.tele.dk (pfepa.post.tele.dk [195.41.46.235]) by justine.libertine.org (Postfix) with ESMTP id 8B9593A0037 for ; Thu, 22 Jan 2004 01:31:35 -0600 (CST) Original-Received: from [195.249.130.41] (0xc3f98229.esnxr3.ras.tele.dk [195.249.130.41]) by pfepa.post.tele.dk (Postfix) with ESMTP id 77C2747FFFC for ; Thu, 22 Jan 2004 08:31:27 +0100 (CET) Original-To: ding@gnus.org Mail-Followup-To: ding@gnus.org In-Reply-To: <87vfn5owzq.fsf@virgil.koldfront.dk> (Adam =?iso-8859-1?q?Sj=F8gren's?= message of "Wed, 21 Jan 2004 21:41:29 +0100") User-Agent: Gnus/5.110002 (No Gnus v0.2) Emacs/21.3.50 (gnu/linux) Precedence: bulk Xref: main.gmane.org gmane.emacs.gnus.general:56153 X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:56153 --=-=-= Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable spamtrap@koldfront.dk (Adam Sj=F8gren) writes: > My results are: > > Jesper Harder (spamwash.el+patch): 0.995423 > Original spam-stat.el (ngnus-0.1): 0.995664 > Andrew Cohen (patch) : 0.997591 > > Each time I installing the version to be tested, training and then ran > spam-stat-test-directory on a spam-group with 4151 emails in it. This is what I get with Andrew's latest version and the attached version of spamwash[1]: Spam Ham Time ----------------------------------------------------------- none 4900/5286 =3D 0.927 2/1641 =3D 0.0012 215 s spam-wash 5136/5286 =3D 0.972 2/1641 =3D 0.0012 486 s spamwash 5159/5286 =3D 0.976 3/1641 =3D 0.0018 394 s The difference in detection rate between the two washers is probably not large enough to be statistically significant. And it's hardly surprising that they're very close since they do nearly the same thing. The only major difference in output is that spamwash doesn't delete the MIME headers: ------=3D_NextPart_1169_0527773208410 Content-Type: text/html; charset=3Diso-8859-1 Content-Transfer-Encoding: Quoted-Printable I don't know if it's the case, but some of that information might be useful for the Bayesian filter. Another possible advantage is that I think it's easier to customize for users. For example, if you wanted to wash HTML with Lynx before analysis (to defeat poison words inserted as HTML comments) you could write something like (defun spamwash-treat-html (cte ctl) (spamwash-decode-body cte ctl) (let ((func (cdr (assq 'lynx mm-text-html-washer-alist)))) (apply (car func) (cdr func)))) and add ("text/html" . spamwash-treat-html) to `spamwash-treatment-alist'. An advantage with Andrew's code is that it's based on better tested and debugged code. [1] I stripped the "Xref" header before training (committed). Otherwise the prediction rate is too optimistic. I also did (define-coding-system-alias 'ks_c_5601-1987 'euc-kr), which helped both washers quite a bit. ks_c_5601-1987 seems to be an alias or superset of euc-kr (does someone know?). Maybe we should add it to `mm-charset-synonym-alist'. --=-=-= Content-Type: application/emacs-lisp Content-Disposition: attachment; filename=spamwash.el Content-Transfer-Encoding: quoted-printable ;;; spamwash.el ;;; Code: (require 'mail-parse) (require 'mail-utils) (require 'mm-bodies) (require 'mm-decode) (defgroup spamwash nil "Washing of messages prior to further spam analysis." :version "22" :group 'spam) (defcustom spamwash-treatment-alist '(("multipart/.*" . spamwash-treat-multipart) ("text/.*" . spamwash-decode-body) ("image/.*" . spamwash-strip-part) ("application/octet-stream" . spamwash-strip-part)) "An alist of (TYPE . FUNCTION) pairs for `spamwash'. TYPE is a regexp matching a MIME type and FUNCTION is a function applied to matching MIME parts." :type '(alist :key-type regexp :value-type function) :group 'spamwash) (defun spamwash-decode-body (cte ctl) "Decode body according to CTE and charset." (let ((charset (or (mail-content-type-get ctl 'charset) "ascii"))) (set-buffer-multibyte t) (mm-decode-body charset cte (car ctl)))) (defun spamwash-strip-part (cte ctl) "Remove part." (delete-region (point-min) (point-max))) (defun spamwash-treat-multipart (cte ctl) "Treat a multipart MIME part." (let* ((boundary (concat "^--" (regexp-quote (mail-content-type-get ctl 'boundary)))) (eboundary (concat boundary "--[ \t]*$")) (boundary (concat boundary "[ \t]*$")) start) (save-excursion (save-excursion (narrow-to-region (point) (or (re-search-forward eboundary nil t) (point-max)))) (while (re-search-forward boundary nil t) (forward-line 1) (setq start (point)) (when (or (re-search-forward boundary nil 'move) (eobp)) (end-of-line 0) (save-restriction (narrow-to-region start (point)) (spamwash))))))) (defun spamwash () "Wash current buffer." (interactive) (let (cte ct ctl match buffer-read-only) (save-excursion (save-restriction (mail-narrow-to-head) (mail-decode-encoded-word-region (point-min) (point-max)) (setq cte (intern (downcase (or (mail-fetch-field "content-transfer-encoding") "7bit")))) (setq ct (mail-fetch-field "content-type")) (let ((case-fold-search t)) (setq ctl (if ct ;; Optimize for the most common case. (if (string-match "^text/plain;[ \t\n\r]*charset=3D\"?\\([^;\"]+\\)" ct) (list "text/plain" (cons 'charset (match-string 1 ct))) (condition-case nil (rfc2231-parse-string ct) (error '("text/plain")))) '("text/plain"))) (setq match (mm-assoc-string-match spamwash-treatment-alist (or (car ctl) "text/plain"))))) (when match (save-restriction (narrow-to-region (progn (goto-char (point-min)) (or (search-forward "\n\n" nil t) (point-max))) (point-max)) (funcall (cdr match) cte ctl)))))) (provide 'spamwash) ;;; spamwash.el ends here --=-=-=--