From: pinard@iro.umontreal.ca (François Pinard)
Subject: Using Eric Raymond's bogofilter tool within Gnus
Date: Sat, 31 Aug 2002 23:16:35 -0400 [thread overview]
Message-ID: <oqhehaqu0c.fsf@titan.progiciels-bpi.ca> (raw)
Hello, my friends.
Some of you might be aware of the speedy Graham filter written by Eric Raymond
last week. Here is the recipe I cooked for using it from within Gnus.
Despite a bit rough, a bit raw, it might be usable. Of course, if you improve
on it, please tell me, so I can take advantage of your ideas as well! :-)
---------------------------------------------------------------------->
;;; Gnus against SPAM, training a Graham filter via Bogofilter.
;;; François Pinard, 2002-08.
;;; This Emacs Lisp code add a few hooks so Oort Gnus could control spam with
;;; the help of Eric Raymond's Bogofilter. I use Oort Gnus version 0.8 in
;;; development from CVS, see http://quimby.gnus.org/gnus, and Bogofilter 0.4,
;;; see http://www.tuxedo.org/~esr/bogofilter, slightly patched.
;;; The `M-d' command gets added Gnus summary mode to mark current message as
;;; spam, this is indicated by the letter `H'. Whenever you see a spam
;;; message, make sure to mark its summary line with `M-d' before leaving the
;;; group. Some groups, as per variable `fp-junk-mailgroups' below,
;;; automatically get an `H' mark for all summary lines which would otherwise
;;; have no other mark. These groups are normally receiving spam resulting
;;; from splitting on clues added by spam recognisers. Make sure to _remove_
;;; `H' marks for any message which is _not_ genuine spam, before leaving the
;;; group (through `M-u' to "unread" the message, or `d' for declaring it read
;;; the non-spam way). When you leave a group, all `H' marked messages are
;;; sent to Bogofilter which will study them as spam samples.
;;; Messages may also be deleted in various other ways, and unless
;;; `fp-ham-marks-form' gets overridden below, marks `R' and `r' for default
;;; read or explicit delete, marks `X' and 'K' for automatic or explicit
;;; kills, as well as mark `Y' for low scores, are all considered to be
;;; associate with messages which are not spam. This assumption might be
;;; false, in particular if you use kill files or score files as means for
;;; detecting genuine spam, you should then adjust `fp-ham-marks-form'. If
;;; you explicit kill a lot, you might sometimes end up with messages marked
;;; `K' which you never saw, and which might accidentally contain spam. Best
;;; is to make sure that real spam is marked with `H', and nothing else. When
;;; you leave a group, all messages which the above marks, besides `H' of
;;; course, are sent to Bogofilter which will study these as not-spam samples.
;;; All other marks do not contribute to Bogofilter pre-conditioning. In
;;; particular, ticked, dormant or souped articles are likely to contribute
;;; later, when they will get deleted for real. Expired explicitely articles
;;; do not contribute, command `E' is a way to get rid of a message without
;;; Bogofilter ever seeing it.
;;; In a word, with a minimum of care for associating the `H' mark for spam
;;; messages only, Bogofilter training all gets fairly automatic. You should
;;; do this until you get a few hundreds of messages in both categories, spam
;;; or not. Use the shell command `head -1 ~/.bogofilter/*' to check message
;;; counts. Then, the command `S S' in summary mode, either for debugging or
;;; for curiosity, triggers Bogofilter into displaying in another buffer the
;;; "spamicity" score (between 0.0 and 1.0) of the current message, with the
;;; message words which most significantly contributed to the score.
;;; The real way for using Bogofilter, however, is to have some external tool
;;; like `procmail' for invoking it on message reception, adding some
;;; recognisable header in case of detected spam. Gnus splitting rules might
;;; later trip on these added headers and react by sorting such messages into
;;; specific junk folders.
(defvar fp-spaminfo-header-regexp
"^X-jf:\\|^X-Junk.*:\\|^X-NoSpam:\\|^X-Spammer:\\|^X-SB"
"Regexp for spam markups in headers.
Markup from spam recognisers, as well as `Xref', are to be removed from
messages before they get registered by Bogofilter.")
(defvar fp-bogofilter-path (executable-find "bogofilter")
"Path of the Bogofilter program.")
(defvar fp-junk-mailgroups '("mail.junk" "poste.pourriel")
"Mailgroups which are dedicated by splitting to receive various junk.
All unmarked article in such group receive the spam mark on group entry.")
(defvar fp-ham-marks-form
'(list gnus-del-mark gnus-read-mark gnus-killed-mark
gnus-kill-file-mark gnus-low-score-mark)
"Eval form yielding marks considered as being ham (positively not spam).
Such messages will be transmitted to `bogofilter -n' on group exit.")
(defvar fp-spam-marks-form
'(list gnus-spam-mark)
"Eval form yielding marks considered as being spam (positively spam).
Such messages will be transmitted to `bogofilter -s' on group exit.")
(defvar fp-ham-marks)
(defvar fp-spam-marks)
(when fp-bogofilter-path
(eval-after-load "gnus-sum"
'(progn
(gnus-define-keys gnus-summary-mode-map
"\M-d" gnus-summary-mark-as-spam
"SS" fp-gnus-summary-bogofilter-score)
(add-hook 'gnus-summary-prepare-hook
'fp-gnus-summary-mark-junk-as-spam-routine)
(add-hook 'gnus-summary-prepare-exit-hook
'fp-gnus-summary-bogofilter-register-routine)
(setq fp-ham-marks (eval fp-ham-marks-form)
fp-spam-marks (eval fp-spam-marks-form))
(push '((eq mark gnus-spam-mark) . gnus-splash-face)
gnus-summary-highlight))))
(defun fp-gnus-summary-bogofilter-score ()
"Use `bogofilter -v' on the current message.
This yields the 15 most discriminant words for this message and the
spamicity coefficient of each, and the overall message spamicity."
(interactive)
(fp-gnus-bogofilter-articles nil "-v" (list (gnus-summary-article-number))))
(defun fp-gnus-summary-mark-junk-as-spam-routine ()
(when (member gnus-newsgroup-name fp-junk-mailgroups)
(let ((articles gnus-newsgroup-articles)
article)
(while articles
(setq article (pop articles))
(when (eq (gnus-summary-article-mark article) gnus-unread-mark)
(gnus-summary-mark-article article gnus-spam-mark))))))
(defun fp-gnus-summary-bogofilter-register-routine ()
(let ((articles gnus-newsgroup-articles)
article mark ham-articles spam-articles)
(while articles
(setq article (pop articles)
mark (gnus-summary-article-mark article))
(cond ((memq mark fp-ham-marks) (push article ham-articles))
((memq mark fp-spam-marks) (push article spam-articles))))
(when ham-articles
(fp-gnus-bogofilter-articles "ham" "-n" ham-articles))
(when spam-articles
(fp-gnus-bogofilter-articles "spam" "-s" spam-articles))))
(defun fp-gnus-bogofilter-articles (type options articles)
(let* ((output-buffer-name "*Bogofilter Output*")
(output-buffer (get-buffer-create output-buffer-name))
(article-copy (get-buffer-create " *Bogofilter Article Copy*"))
(remove-regexp (concat fp-spaminfo-header-regexp "\\|Xref:"))
(prefix (and type (format "Studying %d articles as `%s'..."
(length articles) type)))
(counter 0)
process article)
(save-excursion (set-buffer output-buffer) (erase-buffer))
(setq process (start-process "bogofilter" output-buffer
fp-bogofilter-path options))
(process-kill-without-query process t)
(save-window-excursion
(while articles
(when (and prefix (zerop (% counter 10)))
(message "%s %d" prefix (1+ counter)))
(setq counter (1+ counter))
(gnus-summary-goto-subject (pop articles))
(gnus-summary-select-article)
(gnus-eval-in-buffer-window article-copy
(insert-buffer-substring gnus-original-article-buffer)
;; Remove spam classification redundant headers: they may induce
;; unwanted biases in Bayesian analysis.
(goto-char (point-min))
(while (not (or (eobp) (= (following-char) ?\n)))
(if (looking-at remove-regexp)
(delete-region (point)
(save-excursion (forward-line 1) (point)))
(forward-line 1)))
(goto-char (point-min))
;; Bogofilter really wants From envelopes for counting messages.
;; Fake one at the beginning, make sure there will be no other.
(if (looking-at "From ")
(forward-line 1)
(insert "From nobody " (current-time-string) "\n"))
(let (case-fold-search)
(while (re-search-forward "^From " nil t)
(beginning-of-line)
(insert ">")))
(process-send-region process (point-min) (point-max))
(erase-buffer)))
(kill-buffer article-copy))
(when prefix
(message "%s %d" prefix counter))
(process-send-eof process)
(while (memq (process-status process) '(run stop))
(accept-process-output process))
(when prefix
(message "%s done!" prefix))
(save-excursion
(set-buffer output-buffer)
(when (= (point-min) (point-max))
(setq output-buffer nil)))
(when output-buffer
(display-message-or-buffer output-buffer output-buffer-name))))
----------------------------------------------------------------------<
--
François Pinard http://www.iro.umontreal.ca/~pinard
next reply other threads:[~2002-09-01 3:16 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-09-01 3:16 François Pinard [this message]
2002-09-03 9:05 ` Matthias Andree
2002-09-03 13:46 ` François Pinard
2002-09-03 14:13 ` Jeremy H. Brown
2002-09-03 15:01 ` Kai Großjohann
2002-09-10 0:33 ` Jeremy H. Brown
2002-09-11 10:47 ` Matthias Andree
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=oqhehaqu0c.fsf@titan.progiciels-bpi.ca \
--to=pinard@iro.umontreal.ca \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).