From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/55194 Path: main.gmane.org!not-for-mail From: Ted Zlatanov Newsgroups: gmane.emacs.gnus.general Subject: spam autodetection in NNTP and other backends (was: Spam/Ham training) Date: Wed, 10 Dec 2003 18:05:15 -0500 Organization: =?koi8-r?q?=F4=C5=CF=C4=CF=D2=20=FA=CC=C1=D4=C1=CE=CF=D7?= @ Cienfuegos Sender: ding-owner@lists.math.uh.edu Message-ID: <4n65gol1ec.fsf_-_@collins.bwh.harvard.edu> References: <4nhe0kfdt0.fsf@lockgroove.bwh.harvard.edu> <87d6b7oxrm.fsf@everett.mit.edu> NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: sea.gmane.org 1071097603 18185 80.91.224.253 (10 Dec 2003 23:06:43 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Wed, 10 Dec 2003 23:06:43 +0000 (UTC) Cc: Xavier Maillard , ding@gnus.org Original-X-From: ding-owner+M3734@lists.math.uh.edu Thu Dec 11 00:06:38 2003 Return-path: Original-Received: from malifon.math.uh.edu ([129.7.128.13]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1AUDPh-0003r0-00 for ; Thu, 11 Dec 2003 00:06:37 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.math.uh.edu) by malifon.math.uh.edu with smtp (Exim 3.20 #1) id 1AUDPS-0006dU-00; Wed, 10 Dec 2003 17:06:22 -0600 Original-Received: from justine.libertine.org ([66.139.78.221] ident=postfix) by malifon.math.uh.edu with esmtp (Exim 3.20 #1) id 1AUDPG-0006dM-00 for ding@lists.math.uh.edu; Wed, 10 Dec 2003 17:06:10 -0600 Original-Received: from clifford.bwh.harvard.edu (clifford.bwh.harvard.edu [134.174.9.41]) by justine.libertine.org (Postfix) with ESMTP id B368E3A0037 for ; Wed, 10 Dec 2003 17:06:06 -0600 (CST) Original-Received: from collins.bwh.harvard.edu (collins [134.174.9.80]) by clifford.bwh.harvard.edu (8.10.2+Sun/8.11.0) with ESMTP id hBAN5G713738; Wed, 10 Dec 2003 18:05:16 -0500 (EST) Original-Received: from collins.bwh.harvard.edu (localhost [127.0.0.1]) by collins.bwh.harvard.edu (8.12.9+Sun/8.11.0) with ESMTP id hBAN5GuB027106; Wed, 10 Dec 2003 18:05:16 -0500 (EST) Original-Received: (from tzz@localhost) by collins.bwh.harvard.edu (8.12.9+Sun/8.12.9/Submit) id hBAN5F41027103; Wed, 10 Dec 2003 18:05:15 -0500 (EST) Original-To: David Z Maze X-Face: bd.DQ~'29fIs`T_%O%C\g%6jW)yi[zuz6;d4V0`@y-~$#3P_Ng{@m+e4o<4P'#(_GJQ%TT= D}[Ep*b!\e,fBZ'j_+#"Ps?s2!4H2-Y"sx" Mail-Followup-To: David Z Maze , Xavier Maillard , ding@gnus.org In-Reply-To: <87d6b7oxrm.fsf@everett.mit.edu> (David Z. Maze's message of "Tue, 02 Dec 2003 13:55:09 -0500") User-Agent: Gnus/5.1003 (Gnus v5.10.3) Emacs/21.3.50 (usg-unix-v) Precedence: bulk Xref: main.gmane.org gmane.emacs.gnus.general:55194 X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:55194 --=-=-= On Tue, 02 Dec 2003, dmaze@mit.edu wrote: > Ted Zlatanov writes: > >> I am considering adding spam *recognition* when you enter a group, >> which would be useful for NNTP groups which have no splitting. >> Unseen articles would be checked against a blacklist, for instance. > > Oh, I already have code to do that; I don't use it for nntp, but I > do use it for another [MIT-local] read-only backend. David, thanks for the idea. I went from there and produced the attached patch. Please look it over. It introduces symbolic returns of 'ham or 'spam from spam-split if requested, uses the registry so it doesn't recheck cross-posted articles, and generally behaves well enough that I'm happy with it. If it works for you, I can put it in the CVS repository. For Gnus maintainers: this patch produces VERY slow customization buffers, both for groups/topics parameter autodetect-methods and for the parameter variable gnus-autodetect-methods. Can someone please tell me what I'm doing wrong? I may have to break out the spam.el customizations from the regular `G c' customization buffer so I'm not slowing down Gnus in general. Thanks Ted --=-=-= Content-Type: text/x-patch Content-Disposition: attachment; filename=spam-autodetect.patch Index: spam.el =================================================================== RCS file: /usr/local/cvsroot/gnus/lisp/spam.el,v retrieving revision 6.141 diff -u -r6.141 spam.el --- spam.el 10 Dec 2003 19:57:15 -0000 6.141 +++ spam.el 10 Dec 2003 23:00:52 -0000 @@ -63,6 +63,8 @@ ;; autoload gnus-registry (eval-and-compile + (autoload 'gnus-registry-group-count "gnus-registry") + (autoload 'gnus-registry-add-group "gnus-registry") (autoload 'gnus-registry-store-extra-entry "gnus-registry") (autoload 'gnus-registry-fetch-extra "gnus-registry")) @@ -98,6 +100,18 @@ :type 'boolean :group 'spam) +(defcustom spam-split-symbolic-return nil + "Whether spam-split should work with symbols or group names." + :type 'boolean + :group 'spam) + +(defcustom spam-split-symbolic-return-positive nil + "Whether spam-split should ALWAYS work with symbols or group + names. Do not set this if you use spam-split in a fancy split + method." + :type 'boolean + :group 'spam) + (defcustom spam-process-ham-in-spam-groups nil "Whether ham should be processed in spam groups." :type 'boolean @@ -128,6 +142,12 @@ :type 'boolean :group 'spam) +(defcustom spam-autodetect-recheck-messages nil + "Should spam.el recheck all meessages when autodetecting? +Normally this is nil, so only unseen messages will be checked." + :type 'boolean + :group 'spam) + (defcustom spam-whitelist (expand-file-name "whitelist" spam-directory) "The location of the whitelist. The file format is one regular expression per line. @@ -417,6 +437,12 @@ (defvar spam-old-spam-articles nil "List of old spam articles, generated when a group is entered.") +(defvar spam-split-disabled nil + "If non-nil, spam-split is disabled, and always returns nil.") + +(defvar spam-split-last-successful-check nil + "spam-split will set this to nil or a spam-use-XYZ check if it + finds ham or spam.") ;; convenience functions (defun spam-xor (a b) ; logical exclusive or @@ -817,16 +843,18 @@ (spam-use-hashcash . spam-check-hashcash) (spam-use-bogofilter-headers . spam-check-bogofilter-headers) (spam-use-bogofilter . spam-check-bogofilter)) - "The spam-list-of-checks list contains pairs associating a parameter -variable with a spam checking function. If the parameter variable is -true, then the checking function is called, and its value decides what -happens. Each individual check may return nil, t, or a mailgroup -name. The value nil means that the check does not yield a decision, -and so, that further checks are needed. The value t means that the -message is definitely not spam, and that further spam checks should be -inhibited. Otherwise, a mailgroup name is returned where the mail -should go, and further checks are also inhibited. The usual mailgroup -name is the value of `spam-split-group', meaning that the message is + "The spam-list-of-checks list contains pairs associating a +parameter variable with a spam checking function. If the +parameter variable is true, then the checking function is called, +and its value decides what happens. Each individual check may +return nil, t, or a mailgroup name. The value nil means that the +check does not yield a decision, and so, that further checks are +needed. The value t means that the message is definitely not +spam, and that further spam checks should be inhibited. +Otherwise, a mailgroup name or the symbol 'spam (depending on +spam-split-symbolic-return) is returned where the mail should go, +and further checks are also inhibited. The usual mailgroup name +is the value of `spam-split-group', meaning that the message is definitely a spam.") (defvar spam-list-of-statistical-checks @@ -838,9 +866,6 @@ "The spam-list-of-statistical-checks list contains all the mail splitters that need to have the full message body available.") -(defvar spam-split-disabled nil - "If non-nil, spam-split is disabled, and always returns nil.") - ;;;TODO: modify to invoke self with each check if invoked without specifics (defun spam-split (&rest specific-checks) "Split this message into the `spam' group if it is spam. @@ -851,6 +876,7 @@ See the Info node `(gnus)Fancy Mail Splitting' for more details." (interactive) + (setq spam-split-last-successful-check nil) (unless spam-split-disabled (let ((spam-split-group-choice spam-split-group)) (dolist (check specific-checks) @@ -877,11 +903,69 @@ (memq (car pair) specific-checks))) (gnus-message 5 "spam-split: calling the %s function" (symbol-name (cdr pair))) - (setq decision (funcall (cdr pair)))))) + (setq decision (funcall (cdr pair))) + ;; if we got a decision at all, save the current check + (when decision + (setq spam-split-last-successful-check (car pair))) + + (when (eq decision 'spam) + (if spam-split-symbolic-return + (setq decision spam-split-group) + (gnus-error + 5 + (format "spam-split got %s but %s is nil" + (symbol-name decision) + (symbol-name spam-split-symbolic-return)))))))) (if (eq decision t) - nil + (if spam-split-symbolic-return-positive 'ham nil) decision)))))))) +(defun spam-find-spam () + "This function will detect spam in the current newsgroup using spam-split" + (interactive) + + (let* ((group gnus-newsgroup-name) + (autodetect (gnus-parameter-spam-autodetect group)) + (methods (gnus-parameter-spam-autodetect-methods group)) + (first-method (nth 0 methods))) + (when (and autodetect + (not (equal first-method 'none))) + (mapcar + (lambda (article) + (let ((id (spam-fetch-field-message-id-fast article)) + (subject (spam-fetch-field-subject-fast article)) + (sender (spam-fetch-field-from-fast article))) + (unless (and spam-log-to-registry + (spam-log-registered-p id 'incoming)) + (let* ((spam-split-symbolic-return t) + (spam-split-symbolic-return-positive t) + (split-return + (with-temp-buffer + (gnus-request-article-this-buffer + article + group) + (if (or (null first-method) + (equal first-method 'default)) + (spam-split) + (apply 'spam-split methods))))) + (if (equal split-return 'spam) + (gnus-summary-mark-article article gnus-spam-mark)) + + (when (and split-return spam-log-to-registry) + (when (zerop (gnus-registry-group-count id)) + (gnus-registry-add-group + id group subject sender)) + + (spam-log-processing-to-registry + id + 'incoming + split-return + spam-split-last-successful-check + group)))))) + (if spam-autodetect-recheck-messages + gnus-newsgroup-articles + gnus-newsgroup-unseen))))) + (defvar spam-registration-functions ;; first the ham register, second the spam register function ;; third the ham unregister, fourth the spam unregister function @@ -1022,6 +1106,17 @@ (gnus-message 5 (format "%s called with bad ID, type, classification, check, or group" "spam-log-processing-to-registry"))))) +;;; check if a ham- or spam-processor registration has been done +(defun spam-log-registered-p (id type) + (when spam-log-to-registry + (if (and (stringp id) + (spam-process-type-valid-p type)) + (cdr-safe (gnus-registry-fetch-extra id type)) + (progn + (gnus-message 5 (format "%s called with bad ID, type, classification, or check" + "spam-log-registered-p")) + nil)))) + ;;; check if a ham- or spam-processor registration needs to be undone (defun spam-log-unregistration-needed-p (id type classification check) (when spam-log-to-registry @@ -1085,6 +1180,9 @@ (defun spam-check-regex-headers (&optional body) (let ((type (if body "body" "header")) + (spam-split-group (if spam-split-symbolic-return + 'spam + spam-split-group)) ret found) (dolist (h-regex spam-regex-headers-ham) (unless found @@ -1113,6 +1211,9 @@ (defun spam-check-blackholes () "Check the Received headers for blackholed relays." (let ((headers (nnmail-fetch-field "received")) + (spam-split-group (if spam-split-symbolic-return + 'spam + spam-split-group)) ips matches) (when headers (with-temp-buffer @@ -1209,7 +1310,10 @@ (defun spam-check-BBDB () "Mail from people in the BBDB is classified as ham or non-spam" - (let ((who (nnmail-fetch-field "from"))) + (let ((who (nnmail-fetch-field "from")) + (spam-split-group (if spam-split-symbolic-return + 'spam + spam-split-group))) (when who (setq who (nth 1 (gnus-extract-address-components who))) (if (bbdb-search-simple nil who) @@ -1243,6 +1347,9 @@ (defun spam-check-ifile () "Check the ifile backend for the classification of this message" (let ((article-buffer-name (buffer-name)) + (spam-split-group (if spam-split-symbolic-return + 'spam + spam-split-group)) category return) (with-temp-buffer (let ((temp-buffer-name (buffer-name)) @@ -1305,7 +1412,10 @@ (defun spam-check-stat () "Check the spam-stat backend for the classification of this message" - (let ((spam-stat-split-fancy-spam-group spam-split-group) ; override + (let ((spam-split-group (if spam-split-symbolic-return + 'spam + spam-split-group)) + (spam-stat-split-fancy-spam-group spam-split-group) ; override (spam-stat-buffer (buffer-name)) ; stat the current buffer category return) (spam-stat-split-fancy))) @@ -1415,19 +1525,25 @@ ;;; spam-split-group otherwise (defun spam-check-whitelist () ;; FIXME! Should it detect when file timestamps change? - (unless spam-whitelist-cache - (setq spam-whitelist-cache (spam-parse-list spam-whitelist))) - (if (spam-from-listed-p spam-whitelist-cache) - t - (if spam-use-whitelist-exclusive - spam-split-group - nil))) + (let ((spam-split-group (if spam-split-symbolic-return + 'spam + spam-split-group))) + (unless spam-whitelist-cache + (setq spam-whitelist-cache (spam-parse-list spam-whitelist))) + (if (spam-from-listed-p spam-whitelist-cache) + t + (if spam-use-whitelist-exclusive + spam-split-group + nil)))) (defun spam-check-blacklist () ;; FIXME! Should it detect when file timestamps change? - (unless spam-blacklist-cache - (setq spam-blacklist-cache (spam-parse-list spam-blacklist))) - (and (spam-from-listed-p spam-blacklist-cache) spam-split-group)) + (let ((spam-split-group (if spam-split-symbolic-return + 'spam + spam-split-group))) + (unless spam-blacklist-cache + (setq spam-blacklist-cache (spam-parse-list spam-blacklist))) + (and (spam-from-listed-p spam-blacklist-cache) spam-split-group))) (defun spam-parse-list (file) (when (file-readable-p file) @@ -1518,7 +1634,10 @@ ;;;; Bogofilter (defun spam-check-bogofilter-headers (&optional score) - (let ((header (nnmail-fetch-field spam-bogofilter-header))) + (let ((header (nnmail-fetch-field spam-bogofilter-header)) + (spam-split-group (if spam-split-symbolic-return + 'spam + spam-split-group))) (when header ; return nil when no header (if score ; scoring mode (if (string-match "spamicity=\\([0-9.]+\\)" header) @@ -1600,7 +1719,10 @@ ;;;; spamoracle (defun spam-check-spamoracle () "Run spamoracle on an article to determine whether it's spam." - (let ((article-buffer-name (buffer-name))) + (let ((article-buffer-name (buffer-name)) + (spam-split-group (if spam-split-symbolic-return + 'spam + spam-split-group))) (with-temp-buffer (let ((temp-buffer-name (buffer-name))) (save-excursion @@ -1673,7 +1795,8 @@ (add-hook 'gnus-startup-hook 'spam-maybe-spam-stat-load) (add-hook 'gnus-summary-prepare-exit-hook 'spam-summary-prepare-exit) (add-hook 'gnus-summary-prepare-hook 'spam-summary-prepare) - (add-hook 'gnus-get-new-news-hook 'spam-setup-widening)) + (add-hook 'gnus-get-new-news-hook 'spam-setup-widening) + (add-hook 'gnus-summary-prepare-hook 'spam-find-spam)) (defun spam-unload-hook () "Uninstall the spam.el hooks" @@ -1683,7 +1806,8 @@ (remove-hook 'gnus-startup-hook 'spam-maybe-spam-stat-load) (remove-hook 'gnus-summary-prepare-exit-hook 'spam-summary-prepare-exit) (remove-hook 'gnus-summary-prepare-hook 'spam-summary-prepare) - (remove-hook 'gnus-get-new-news-hook 'spam-setup-widening)) + (remove-hook 'gnus-get-new-news-hook 'spam-setup-widening) + (remove-hook 'gnus-summary-prepare-hook 'spam-find-spam)) (when spam-install-hooks (spam-initialize)) Index: gnus.el =================================================================== RCS file: /usr/local/cvsroot/gnus/lisp/gnus.el,v retrieving revision 6.210 diff -u -r6.210 gnus.el --- gnus.el 30 Nov 2003 04:50:28 -0000 6.210 +++ gnus.el 10 Dec 2003 23:00:53 -0000 @@ -1960,6 +1960,88 @@ "Which spam or ham processors will be applied when the summary is exited.") (gnus-define-group-parameter + spam-autodetect + :type list + :parameter-type + 'boolean + :function-document + "Should spam be autodetected (with spam-split) in this group?" + :variable gnus-spam-autodetect + :variable-default nil + :variable-document + "*Groups in which spam should be autodetected when they are entered. + Only unseen articles will be examined, unless + spam-autodetect-recheck-messages is set." + :variable-group spam + :variable-type + '(repeat + :tag "Autodetection setting" + (list + (regexp :tag "Group Regexp") + boolean)) + :parameter-document + "Spam autodetection. +Only unseen articles will be examined, unless +spam-autodetect-recheck-messages is set.") + + (gnus-define-group-parameter + spam-autodetect-methods + :type list + :parameter-type + '(choice + (const none) + (const default) + (set :tag "Use specific methods" + (variable-item spam-use-blacklist) + (variable-item spam-use-regex-headers) + (variable-item spam-use-regex-body) + (variable-item spam-use-whitelist) + (variable-item spam-use-BBDB) + (variable-item spam-use-ifile) + (variable-item spam-use-spamoracle) + (variable-item spam-use-stat) + (variable-item spam-use-blackholes) + (variable-item spam-use-hashcash) + (variable-item spam-use-bogofilter-headers) + (variable-item spam-use-bogofilter))) + :function-document + "Methods to be used for autodetection in each group" + :variable gnus-spam-autodetect-methods + :variable-default nil + :variable-document + "*Methods for autodetecting spam per group. +Requires the spam-autodetect parameter. Only unseen articles +will be examined, unless spam-autodetect-recheck-messages is +set." + :variable-group spam + :variable-type + '(repeat + :tag "Autodetection methods" + (list + (regexp :tag "Group Regexp") + (choice + (const none) + (const default) + (set :tag "Use specific methods" + (variable-item spam-use-blacklist) + (variable-item spam-use-regex-headers) + (variable-item spam-use-regex-body) + (variable-item spam-use-whitelist) + (variable-item spam-use-BBDB) + (variable-item spam-use-ifile) + (variable-item spam-use-spamoracle) + (variable-item spam-use-stat) + (variable-item spam-use-blackholes) + (variable-item spam-use-hashcash) + (variable-item spam-use-bogofilter-headers) + (variable-item spam-use-bogofilter))))) + :parameter-document + "Spam autodetection methods. +Requires the spam-autodetect parameter. Only unseen articles +will be examined, unless spam-autodetect-recheck-messages is +set.") + + (gnus-define-group-parameter spam-process-destination :type list :parameter-type --=-=-=--