Gnus development mailing list
 help / color / mirror / Atom feed
* Using Eric Raymond's bogofilter tool within Gnus
@ 2002-09-01  3:16 François Pinard
  2002-09-03  9:05 ` Matthias Andree
  0 siblings, 1 reply; 7+ messages in thread
From: François Pinard @ 2002-09-01  3:16 UTC (permalink / raw)


Hello, my friends.

Some of you might be aware of the speedy Graham filter written by Eric Raymond
last week.  Here is the recipe I cooked for using it from within Gnus.
Despite a bit rough, a bit raw, it might be usable.  Of course, if you improve
on it, please tell me, so I can take advantage of your ideas as well! :-)

---------------------------------------------------------------------->
;;; Gnus against SPAM, training a Graham filter via Bogofilter.
;;; François Pinard, 2002-08.

;;; This Emacs Lisp code add a few hooks so Oort Gnus could control spam with
;;; the help of Eric Raymond's Bogofilter.  I use Oort Gnus version 0.8 in
;;; development from CVS, see http://quimby.gnus.org/gnus, and Bogofilter 0.4,
;;; see http://www.tuxedo.org/~esr/bogofilter, slightly patched.

;;; The `M-d' command gets added Gnus summary mode to mark current message as
;;; spam, this is indicated by the letter `H'.  Whenever you see a spam
;;; message, make sure to mark its summary line with `M-d' before leaving the
;;; group.  Some groups, as per variable `fp-junk-mailgroups' below,
;;; automatically get an `H' mark for all summary lines which would otherwise
;;; have no other mark.  These groups are normally receiving spam resulting
;;; from splitting on clues added by spam recognisers.  Make sure to _remove_
;;; `H' marks for any message which is _not_ genuine spam, before leaving the
;;; group (through `M-u' to "unread" the message, or `d' for declaring it read
;;; the non-spam way).  When you leave a group, all `H' marked messages are
;;; sent to Bogofilter which will study them as spam samples.

;;; Messages may also be deleted in various other ways, and unless
;;; `fp-ham-marks-form' gets overridden below, marks `R' and `r' for default
;;; read or explicit delete, marks `X' and 'K' for automatic or explicit
;;; kills, as well as mark `Y' for low scores, are all considered to be
;;; associate with messages which are not spam.  This assumption might be
;;; false, in particular if you use kill files or score files as means for
;;; detecting genuine spam, you should then adjust `fp-ham-marks-form'.  If
;;; you explicit kill a lot, you might sometimes end up with messages marked
;;; `K' which you never saw, and which might accidentally contain spam.  Best
;;; is to make sure that real spam is marked with `H', and nothing else.  When
;;; you leave a group, all messages which the above marks, besides `H' of
;;; course, are sent to Bogofilter which will study these as not-spam samples.

;;; All other marks do not contribute to Bogofilter pre-conditioning.  In
;;; particular, ticked, dormant or souped articles are likely to contribute
;;; later, when they will get deleted for real.  Expired explicitely articles
;;; do not contribute, command `E' is a way to get rid of a message without
;;; Bogofilter ever seeing it.

;;; In a word, with a minimum of care for associating the `H' mark for spam
;;; messages only, Bogofilter training all gets fairly automatic.  You should
;;; do this until you get a few hundreds of messages in both categories, spam
;;; or not.  Use the shell command `head -1 ~/.bogofilter/*' to check message
;;; counts.  Then, the command `S S' in summary mode, either for debugging or
;;; for curiosity, triggers Bogofilter into displaying in another buffer the
;;; "spamicity" score (between 0.0 and 1.0) of the current message, with the
;;; message words which most significantly contributed to the score.

;;; The real way for using Bogofilter, however, is to have some external tool
;;; like `procmail' for invoking it on message reception, adding some
;;; recognisable header in case of detected spam.  Gnus splitting rules might
;;; later trip on these added headers and react by sorting such messages into
;;; specific junk folders.

(defvar fp-spaminfo-header-regexp
  "^X-jf:\\|^X-Junk.*:\\|^X-NoSpam:\\|^X-Spammer:\\|^X-SB"
  "Regexp for spam markups in headers.
Markup from spam recognisers, as well as `Xref', are to be removed from
messages before they get registered by Bogofilter.")

(defvar fp-bogofilter-path (executable-find "bogofilter")
  "Path of the Bogofilter program.")

(defvar fp-junk-mailgroups '("mail.junk" "poste.pourriel")
  "Mailgroups which are dedicated by splitting to receive various junk.
All unmarked article in such group receive the spam mark on group entry.")

(defvar fp-ham-marks-form
  '(list gnus-del-mark gnus-read-mark gnus-killed-mark
			 gnus-kill-file-mark gnus-low-score-mark)
  "Eval form yielding marks considered as being ham (positively not spam).
Such messages will be transmitted to `bogofilter -n' on group exit.")

(defvar fp-spam-marks-form
  '(list gnus-spam-mark)
  "Eval form yielding marks considered as being spam (positively spam).
Such messages will be transmitted to `bogofilter -s' on group exit.")

(defvar fp-ham-marks)
(defvar fp-spam-marks)

(when fp-bogofilter-path
  (eval-after-load "gnus-sum"
    '(progn
       (gnus-define-keys gnus-summary-mode-map
	 "\M-d" gnus-summary-mark-as-spam
	 "SS" fp-gnus-summary-bogofilter-score)
       (add-hook 'gnus-summary-prepare-hook
		 'fp-gnus-summary-mark-junk-as-spam-routine)
       (add-hook 'gnus-summary-prepare-exit-hook
		 'fp-gnus-summary-bogofilter-register-routine)
       (setq fp-ham-marks (eval fp-ham-marks-form)
	     fp-spam-marks (eval fp-spam-marks-form))
       (push '((eq mark gnus-spam-mark) . gnus-splash-face)
	     gnus-summary-highlight))))

(defun fp-gnus-summary-bogofilter-score ()
  "Use `bogofilter -v' on the current message.
This yields the 15 most discriminant words for this message and the
spamicity coefficient of each, and the overall message spamicity."
  (interactive)
  (fp-gnus-bogofilter-articles nil "-v" (list (gnus-summary-article-number))))

(defun fp-gnus-summary-mark-junk-as-spam-routine ()
  (when (member gnus-newsgroup-name fp-junk-mailgroups)
    (let ((articles gnus-newsgroup-articles)
	  article)
      (while articles
	(setq article (pop articles))
	(when (eq (gnus-summary-article-mark article) gnus-unread-mark)
	  (gnus-summary-mark-article article gnus-spam-mark))))))

(defun fp-gnus-summary-bogofilter-register-routine ()
  (let ((articles gnus-newsgroup-articles)
	article mark ham-articles spam-articles)
    (while articles
      (setq article (pop articles)
	    mark (gnus-summary-article-mark article))
      (cond ((memq mark fp-ham-marks) (push article ham-articles))
	    ((memq mark fp-spam-marks) (push article spam-articles))))
    (when ham-articles
      (fp-gnus-bogofilter-articles "ham" "-n" ham-articles))
    (when spam-articles
      (fp-gnus-bogofilter-articles "spam" "-s" spam-articles))))

(defun fp-gnus-bogofilter-articles (type options articles)
  (let* ((output-buffer-name "*Bogofilter Output*")
	 (output-buffer (get-buffer-create output-buffer-name))
	 (article-copy (get-buffer-create " *Bogofilter Article Copy*"))
	 (remove-regexp (concat fp-spaminfo-header-regexp "\\|Xref:"))
	 (prefix (and type (format "Studying %d articles as `%s'..."
				   (length articles) type)))
	 (counter 0)
	 process article)
    (save-excursion (set-buffer output-buffer) (erase-buffer))
    (setq process (start-process "bogofilter" output-buffer
				 fp-bogofilter-path options))
    (process-kill-without-query process t)
    (save-window-excursion
      (while articles
	(when (and prefix (zerop (% counter 10)))
	  (message "%s %d" prefix (1+ counter)))
	(setq counter (1+ counter))
	(gnus-summary-goto-subject (pop articles))
	(gnus-summary-select-article)
	(gnus-eval-in-buffer-window article-copy
	  (insert-buffer-substring gnus-original-article-buffer)
	  ;; Remove spam classification redundant headers: they may induce
	  ;; unwanted biases in Bayesian analysis.
	  (goto-char (point-min))
	  (while (not (or (eobp) (= (following-char) ?\n)))
	    (if (looking-at remove-regexp)
		(delete-region (point)
			       (save-excursion (forward-line 1) (point)))
	      (forward-line 1)))
	  (goto-char (point-min))
	  ;; Bogofilter really wants From envelopes for counting messages.
	  ;; Fake one at the beginning, make sure there will be no other.
	  (if (looking-at "From ")
	      (forward-line 1)
	    (insert "From nobody " (current-time-string) "\n"))
	  (let (case-fold-search)
	    (while (re-search-forward "^From " nil t)
	      (beginning-of-line)
	      (insert ">")))
	  (process-send-region process (point-min) (point-max))
	  (erase-buffer)))
      (kill-buffer article-copy))
    (when prefix
      (message "%s %d" prefix counter))
    (process-send-eof process)
    (while (memq (process-status process) '(run stop))
      (accept-process-output process))
    (when prefix
      (message "%s done!" prefix))
    (save-excursion
      (set-buffer output-buffer)
      (when (= (point-min) (point-max))
	(setq output-buffer nil)))
    (when output-buffer
      (display-message-or-buffer output-buffer output-buffer-name))))
----------------------------------------------------------------------<

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2002-09-11 10:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-01  3:16 Using Eric Raymond's bogofilter tool within Gnus François Pinard
2002-09-03  9:05 ` Matthias Andree
2002-09-03 13:46   ` François Pinard
2002-09-03 14:13     ` Jeremy H. Brown
2002-09-03 15:01       ` Kai Großjohann
2002-09-10  0:33         ` Jeremy H. Brown
2002-09-11 10:47     ` Matthias Andree

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).