From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/56116 Path: main.gmane.org!not-for-mail From: Jesper Harder Newsgroups: gmane.emacs.gnus.general Subject: Re: spam-stat.el and mime Date: Tue, 20 Jan 2004 06:56:17 +0100 Sender: ding-owner@lists.math.uh.edu Message-ID: References: <87u133g3f4.fsf@andy.bu.edu> NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: sea.gmane.org 1074578267 27675 80.91.224.253 (20 Jan 2004 05:57:47 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 20 Jan 2004 05:57:47 +0000 (UTC) Original-X-From: ding-owner+M4656@lists.math.uh.edu Tue Jan 20 06:57:39 2004 Return-path: Original-Received: from malifon.math.uh.edu ([129.7.128.13]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1AiotP-0008OO-00 for ; Tue, 20 Jan 2004 06:57:39 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.math.uh.edu) by malifon.math.uh.edu with smtp (Exim 3.20 #1) id 1AiotC-0001b4-00; Mon, 19 Jan 2004 23:57:26 -0600 Original-Received: from justine.libertine.org ([66.139.78.221] ident=postfix) by malifon.math.uh.edu with esmtp (Exim 3.20 #1) id 1Aiot5-0001az-00 for ding@lists.math.uh.edu; Mon, 19 Jan 2004 23:57:19 -0600 Original-Received: from pfepb.post.tele.dk (pfepb.post.tele.dk [195.41.46.236]) by justine.libertine.org (Postfix) with ESMTP id 7083E3A0057 for ; Mon, 19 Jan 2004 23:57:18 -0600 (CST) Original-Received: from [195.215.224.188] (0xc3d7e0bc.esnxr4.ras.tele.dk [195.215.224.188]) by pfepb.post.tele.dk (Postfix) with ESMTP id 686A25EE17D for ; Tue, 20 Jan 2004 06:57:15 +0100 (CET) Original-To: ding@gnus.org Mail-Followup-To: ding@gnus.org In-Reply-To: <87u133g3f4.fsf@andy.bu.edu> (Andrew Cohen's message of "Sat, 10 Jan 2004 11:43:27 -0500") User-Agent: Gnus/5.110002 (No Gnus v0.2) Emacs/21.3.50 (gnu/linux) Precedence: bulk Xref: main.gmane.org gmane.emacs.gnus.general:56116 X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:56116 --=-=-= Andrew Cohen writes: > Checking a bit this was almost entirely because it did no decoding > of mime (or base64) encoded articles. I've modified it to decode > mime (if you don't like this it can be controlled by customizing the > spam-treat-mime-function to nil). I looked a bit more at it: + (defun spam-treat-article () + "Treat the current buffer prior to spam analysis." + (interactive) + (spam-decode) ^^^^^^^^^^^^^ It doesn't work to call `spam-decode' here -- you have to decode each MIME part separately. It's not so easy to use the existing MIME parsing functions in Gnus for this purpose. They were written with display in mind, and they're also very slow because they do a lot of fancy stuff, which is unnecessary in this context. Please try the attached code (it increased my spam recognition rate by 4 percentage points). --=-=-= Content-Type: application/emacs-lisp Content-Disposition: attachment; filename=spamwash.el Content-Transfer-Encoding: quoted-printable ;;; spamwash.el ;;; Code: (require 'mail-parse) (require 'mail-utils) (require 'mm-bodies) (require 'mm-decode) (defgroup spamwash nil "Washing of messages prior to further spam analysis." :version "22" :group 'spam) (defcustom spamwash-treatment-alist '(("multipart/.*" . spamwash-treat-multipart) ("text/.*" . spamwash-decode-body) ("image/.*" . spamwash-strip-part) ("application/octet-stream" . spamwash-strip-part)) "An alist of (TYPE . FUNCTION) pairs for `spamwash'. TYPE is a regexp matching a MIME type and FUNCTION is a function applied to matching MIME parts." :type '(alist :key-type regexp :value-type: function) :group 'spamwash) (defun spamwash-decode-body (cte ctl) "Decode body according to CTE and charset." (let ((charset (or (mail-content-type-get ctl 'charset) "ascii"))) (set-buffer-multibyte t) (mm-decode-body charset cte (car ctl)))) (defun spamwash-strip-part (cte ctl) "Remove part." (delete-region (point-min) (point-max))) (defun spamwash-treat-multipart (cte ctl) "Treat a multipart MIME part." (let ((boundary (concat "^--" (regexp-quote (mail-content-type-get ctl 'boundary)))) start) (while (re-search-forward boundary nil t) (forward-line 1) (setq start (point)) (when (re-search-forward boundary nil t) (end-of-line 0) (save-restriction (narrow-to-region start (point)) (spamwash)))))) (defun spamwash () "Wash current buffer." (interactive) (let (cte ct ctl match buffer-read-only) (save-excursion (save-restriction (mail-narrow-to-head) (setq cte (mail-fetch-field "content-transfer-encoding")) (setq ctl (if (setq ct (mail-fetch-field "content-type")) (condition-case nil ;; `rfc2231-parse-string' is a major bottleneck, ;; implement something simpler. (rfc2231-parse-string ct) (error '("text/plain"))) '("text/plain")))) (setq match (mm-assoc-string-match spamwash-treatment-alist (car ctl))) (when match (save-restriction (narrow-to-region (progn (goto-char (point-min)) (or (search-forward "\n\n" nil t) (point-max))) (point-max)) (funcall (cdr match) cte ctl)))))) (provide 'spamwash) ;;; spamwash.el ends here --=-=-= Content-Type: text/x-patch Content-Disposition: attachment *** /home/harder/gnus/lisp/spam-stat.el Mon Jan 5 20:12:20 2004 --- /home/harder/cvsgnus/lisp/spam-stat.el Tue Jan 20 06:54:45 2004 *************** *** 122,127 **** --- 122,128 ---- ;;; Code: + (require 'spamwash) (defgroup spam-stat nil "Statistical spam detection for Emacs. *************** *** 171,176 **** --- 172,182 ---- :type 'number :group 'spam-stat) + (defcustom spam-stat-washing-hook '(spamwash) + "Hook applied to each message before analysis." + :type 'hook + :group 'spam-stat) + (defvar spam-stat-syntax-table (let ((table (copy-syntax-table text-mode-syntax-table))) (modify-syntax-entry ?- "w" table) *************** *** 291,296 **** --- 297,303 ---- (defun spam-stat-buffer-words () "Return a hash table of words and number of occurences in the buffer." + (run-hooks 'spam-stat-washing-hook) (with-spam-stat-max-buffer-size (with-syntax-table spam-stat-syntax-table (goto-char (point-min)) *************** *** 369,395 **** "Save the `spam-stat' hash table as lisp file." (interactive) (when (or force spam-stat-dirty) ! (with-temp-buffer ! (let ((standard-output (current-buffer)) ! (font-lock-maximum-size 0)) ! (insert "(setq spam-stat-ngood " ! (number-to-string spam-stat-ngood) ! " spam-stat-nbad " ! (number-to-string spam-stat-nbad) ! " spam-stat (spam-stat-to-hash-table '(") ! (maphash (lambda (word entry) ! (prin1 (list word ! (spam-stat-good entry) ! (spam-stat-bad entry)))) ! spam-stat) ! (insert ")))") ! (write-file spam-stat-file))) (setq spam-stat-dirty nil))) (defun spam-stat-load () "Read the `spam-stat' hash table from disk." ;; TODO: maybe we should warn the user if spam-stat-dirty is t? ! (load-file spam-stat-file) (setq spam-stat-dirty nil)) (defun spam-stat-to-hash-table (entries) --- 376,404 ---- "Save the `spam-stat' hash table as lisp file." (interactive) (when (or force spam-stat-dirty) ! (let ((coding-system-for-write 'emacs-mule)) ! (with-temp-file spam-stat-file ! (let ((standard-output (current-buffer)) ! (font-lock-maximum-size 0)) ! (insert ";-*- coding: emacs-mule; -*-\n") ! (insert "(setq spam-stat-ngood " ! (number-to-string spam-stat-ngood) ! " spam-stat-nbad " ! (number-to-string spam-stat-nbad) ! " spam-stat (spam-stat-to-hash-table '(") ! (maphash (lambda (word entry) ! (prin1 (list word ! (spam-stat-good entry) ! (spam-stat-bad entry)))) ! spam-stat) ! (insert ")))")))) (setq spam-stat-dirty nil))) (defun spam-stat-load () "Read the `spam-stat' hash table from disk." ;; TODO: maybe we should warn the user if spam-stat-dirty is t? ! (let ((coding-system-for-read 'emacs-mule)) ! (load-file spam-stat-file)) (setq spam-stat-dirty nil)) (defun spam-stat-to-hash-table (entries) *************** *** 399,405 **** NBAD is the number of bad mails it has appeared in, GOOD is the number of times it appeared in good mails, and BAD is the number of times it has appeared in bad mails." ! (let ((table (make-hash-table :test 'equal))) (mapc (lambda (l) (puthash (car l) (spam-stat-make-entry (nth 1 l) (nth 2 l)) --- 408,414 ---- NBAD is the number of bad mails it has appeared in, GOOD is the number of times it appeared in good mails, and BAD is the number of times it has appeared in bad mails." ! (let ((table (make-hash-table :size (length entries) :test 'equal))) (mapc (lambda (l) (puthash (car l) (spam-stat-make-entry (nth 1 l) (nth 2 l)) *************** *** 484,490 **** (> (nth 7 (file-attributes f)) 0)) (setq count (1+ count)) (message "Reading %s: %.2f%%" dir (/ count max)) ! (insert-file-contents f) (funcall func) (erase-buffer)))))) --- 493,499 ---- (> (nth 7 (file-attributes f)) 0)) (setq count (1+ count)) (message "Reading %s: %.2f%%" dir (/ count max)) ! (insert-file-contents-literally f) (funcall func) (erase-buffer)))))) *************** *** 522,528 **** (setq count (1+ count)) (message "Reading %.2f%%, score %.2f%%" (/ count max) (/ score count)) ! (insert-file-contents f) (when (> (spam-stat-score-buffer) 0.9) (setq score (1+ score))) (erase-buffer)))) --- 531,537 ---- (setq count (1+ count)) (message "Reading %.2f%%, score %.2f%%" (/ count max) (/ score count)) ! (insert-file-contents-literally f) (when (> (spam-stat-score-buffer) 0.9) (setq score (1+ score))) (erase-buffer)))) --=-=-=--