From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/46329 Path: main.gmane.org!not-for-mail From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Newsgroups: gmane.emacs.gnus.general Subject: Using Eric Raymond's bogofilter tool within Gnus Date: Sat, 31 Aug 2002 23:16:35 -0400 Sender: owner-ding@hpc.uh.edu Message-ID: NNTP-Posting-Host: localhost.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Trace: main.gmane.org 1030850692 10569 127.0.0.1 (1 Sep 2002 03:24:52 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Sun, 1 Sep 2002 03:24:52 +0000 (UTC) Return-path: Original-Received: from malifon.math.uh.edu ([129.7.128.13]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 17lLM2-0002kM-00 for ; Sun, 01 Sep 2002 05:24:50 +0200 Original-Received: from sina.hpc.uh.edu ([129.7.128.10] ident=lists) by malifon.math.uh.edu with esmtp (Exim 3.20 #1) id 17lLHF-0006m4-00; Sat, 31 Aug 2002 22:19:53 -0500 Original-Received: by sina.hpc.uh.edu (TLB v0.09a (1.20 tibbs 1996/10/09 22:03:07)); Sat, 31 Aug 2002 22:20:26 -0500 (CDT) Original-Received: from sclp3.sclp.com (qmailr@sclp3.sclp.com [209.196.61.66]) by sina.hpc.uh.edu (8.9.3/8.9.3) with SMTP id WAA20612 for ; Sat, 31 Aug 2002 22:20:10 -0500 (CDT) Original-Received: (qmail 29905 invoked by alias); 1 Sep 2002 03:19:32 -0000 Original-Received: (qmail 29900 invoked from network); 1 Sep 2002 03:19:32 -0000 Original-Received: from jaseur.sram.qc.ca (postfix@207.35.30.8) by gnus.org with SMTP; 1 Sep 2002 03:19:32 -0000 Original-Received: from titan.progiciels-bpi.ca (maison.sram.qc.ca [207.35.30.203]) by jaseur.sram.qc.ca (Postfix on SuSE Linux 7.2 (i386)) with ESMTP id 1B97B2B553; Sun, 1 Sep 2002 05:18:02 +0200 (CEST) Original-Received: from localhost (localhost [127.0.0.1]) by titan.progiciels-bpi.ca (Postfix on SuSE Linux 8.0 (i386)) with ESMTP id 8BCEA30049; Sat, 31 Aug 2002 23:16:41 -0400 (EDT) Original-Received: by titan.progiciels-bpi.ca (Postfix on SuSE Linux 8.0 (i386), from userid 405) id 0A75730046; Sat, 31 Aug 2002 23:16:35 -0400 (EDT) Original-To: Forum of ding/Gnus users X-Face: "b_m|CE6#'Q8fliQrwHl9K,]PA_o'*S~Dva{~b1n*)K*A(BIwQW.:LY?t4~xhYka_.LV?Qq `}X|71X0ea&H]9Dsk!`kxBXlG;q$mLfv_vtaHK_rHFKu]4'<*LWCyUe@ZcI6"*wB5M@[m ;;; Gnus against SPAM, training a Graham filter via Bogofilter. ;;; François Pinard, 2002-08. ;;; This Emacs Lisp code add a few hooks so Oort Gnus could control spam with ;;; the help of Eric Raymond's Bogofilter. I use Oort Gnus version 0.8 in ;;; development from CVS, see http://quimby.gnus.org/gnus, and Bogofilter 0.4, ;;; see http://www.tuxedo.org/~esr/bogofilter, slightly patched. ;;; The `M-d' command gets added Gnus summary mode to mark current message as ;;; spam, this is indicated by the letter `H'. Whenever you see a spam ;;; message, make sure to mark its summary line with `M-d' before leaving the ;;; group. Some groups, as per variable `fp-junk-mailgroups' below, ;;; automatically get an `H' mark for all summary lines which would otherwise ;;; have no other mark. These groups are normally receiving spam resulting ;;; from splitting on clues added by spam recognisers. Make sure to _remove_ ;;; `H' marks for any message which is _not_ genuine spam, before leaving the ;;; group (through `M-u' to "unread" the message, or `d' for declaring it read ;;; the non-spam way). When you leave a group, all `H' marked messages are ;;; sent to Bogofilter which will study them as spam samples. ;;; Messages may also be deleted in various other ways, and unless ;;; `fp-ham-marks-form' gets overridden below, marks `R' and `r' for default ;;; read or explicit delete, marks `X' and 'K' for automatic or explicit ;;; kills, as well as mark `Y' for low scores, are all considered to be ;;; associate with messages which are not spam. This assumption might be ;;; false, in particular if you use kill files or score files as means for ;;; detecting genuine spam, you should then adjust `fp-ham-marks-form'. If ;;; you explicit kill a lot, you might sometimes end up with messages marked ;;; `K' which you never saw, and which might accidentally contain spam. Best ;;; is to make sure that real spam is marked with `H', and nothing else. When ;;; you leave a group, all messages which the above marks, besides `H' of ;;; course, are sent to Bogofilter which will study these as not-spam samples. ;;; All other marks do not contribute to Bogofilter pre-conditioning. In ;;; particular, ticked, dormant or souped articles are likely to contribute ;;; later, when they will get deleted for real. Expired explicitely articles ;;; do not contribute, command `E' is a way to get rid of a message without ;;; Bogofilter ever seeing it. ;;; In a word, with a minimum of care for associating the `H' mark for spam ;;; messages only, Bogofilter training all gets fairly automatic. You should ;;; do this until you get a few hundreds of messages in both categories, spam ;;; or not. Use the shell command `head -1 ~/.bogofilter/*' to check message ;;; counts. Then, the command `S S' in summary mode, either for debugging or ;;; for curiosity, triggers Bogofilter into displaying in another buffer the ;;; "spamicity" score (between 0.0 and 1.0) of the current message, with the ;;; message words which most significantly contributed to the score. ;;; The real way for using Bogofilter, however, is to have some external tool ;;; like `procmail' for invoking it on message reception, adding some ;;; recognisable header in case of detected spam. Gnus splitting rules might ;;; later trip on these added headers and react by sorting such messages into ;;; specific junk folders. (defvar fp-spaminfo-header-regexp "^X-jf:\\|^X-Junk.*:\\|^X-NoSpam:\\|^X-Spammer:\\|^X-SB" "Regexp for spam markups in headers. Markup from spam recognisers, as well as `Xref', are to be removed from messages before they get registered by Bogofilter.") (defvar fp-bogofilter-path (executable-find "bogofilter") "Path of the Bogofilter program.") (defvar fp-junk-mailgroups '("mail.junk" "poste.pourriel") "Mailgroups which are dedicated by splitting to receive various junk. All unmarked article in such group receive the spam mark on group entry.") (defvar fp-ham-marks-form '(list gnus-del-mark gnus-read-mark gnus-killed-mark gnus-kill-file-mark gnus-low-score-mark) "Eval form yielding marks considered as being ham (positively not spam). Such messages will be transmitted to `bogofilter -n' on group exit.") (defvar fp-spam-marks-form '(list gnus-spam-mark) "Eval form yielding marks considered as being spam (positively spam). Such messages will be transmitted to `bogofilter -s' on group exit.") (defvar fp-ham-marks) (defvar fp-spam-marks) (when fp-bogofilter-path (eval-after-load "gnus-sum" '(progn (gnus-define-keys gnus-summary-mode-map "\M-d" gnus-summary-mark-as-spam "SS" fp-gnus-summary-bogofilter-score) (add-hook 'gnus-summary-prepare-hook 'fp-gnus-summary-mark-junk-as-spam-routine) (add-hook 'gnus-summary-prepare-exit-hook 'fp-gnus-summary-bogofilter-register-routine) (setq fp-ham-marks (eval fp-ham-marks-form) fp-spam-marks (eval fp-spam-marks-form)) (push '((eq mark gnus-spam-mark) . gnus-splash-face) gnus-summary-highlight)))) (defun fp-gnus-summary-bogofilter-score () "Use `bogofilter -v' on the current message. This yields the 15 most discriminant words for this message and the spamicity coefficient of each, and the overall message spamicity." (interactive) (fp-gnus-bogofilter-articles nil "-v" (list (gnus-summary-article-number)))) (defun fp-gnus-summary-mark-junk-as-spam-routine () (when (member gnus-newsgroup-name fp-junk-mailgroups) (let ((articles gnus-newsgroup-articles) article) (while articles (setq article (pop articles)) (when (eq (gnus-summary-article-mark article) gnus-unread-mark) (gnus-summary-mark-article article gnus-spam-mark)))))) (defun fp-gnus-summary-bogofilter-register-routine () (let ((articles gnus-newsgroup-articles) article mark ham-articles spam-articles) (while articles (setq article (pop articles) mark (gnus-summary-article-mark article)) (cond ((memq mark fp-ham-marks) (push article ham-articles)) ((memq mark fp-spam-marks) (push article spam-articles)))) (when ham-articles (fp-gnus-bogofilter-articles "ham" "-n" ham-articles)) (when spam-articles (fp-gnus-bogofilter-articles "spam" "-s" spam-articles)))) (defun fp-gnus-bogofilter-articles (type options articles) (let* ((output-buffer-name "*Bogofilter Output*") (output-buffer (get-buffer-create output-buffer-name)) (article-copy (get-buffer-create " *Bogofilter Article Copy*")) (remove-regexp (concat fp-spaminfo-header-regexp "\\|Xref:")) (prefix (and type (format "Studying %d articles as `%s'..." (length articles) type))) (counter 0) process article) (save-excursion (set-buffer output-buffer) (erase-buffer)) (setq process (start-process "bogofilter" output-buffer fp-bogofilter-path options)) (process-kill-without-query process t) (save-window-excursion (while articles (when (and prefix (zerop (% counter 10))) (message "%s %d" prefix (1+ counter))) (setq counter (1+ counter)) (gnus-summary-goto-subject (pop articles)) (gnus-summary-select-article) (gnus-eval-in-buffer-window article-copy (insert-buffer-substring gnus-original-article-buffer) ;; Remove spam classification redundant headers: they may induce ;; unwanted biases in Bayesian analysis. (goto-char (point-min)) (while (not (or (eobp) (= (following-char) ?\n))) (if (looking-at remove-regexp) (delete-region (point) (save-excursion (forward-line 1) (point))) (forward-line 1))) (goto-char (point-min)) ;; Bogofilter really wants From envelopes for counting messages. ;; Fake one at the beginning, make sure there will be no other. (if (looking-at "From ") (forward-line 1) (insert "From nobody " (current-time-string) "\n")) (let (case-fold-search) (while (re-search-forward "^From " nil t) (beginning-of-line) (insert ">"))) (process-send-region process (point-min) (point-max)) (erase-buffer))) (kill-buffer article-copy)) (when prefix (message "%s %d" prefix counter)) (process-send-eof process) (while (memq (process-status process) '(run stop)) (accept-process-output process)) (when prefix (message "%s done!" prefix)) (save-excursion (set-buffer output-buffer) (when (= (point-min) (point-max)) (setq output-buffer nil))) (when output-buffer (display-message-or-buffer output-buffer output-buffer-name)))) ----------------------------------------------------------------------< -- François Pinard http://www.iro.umontreal.ca/~pinard