Gnus development mailing list
 help / color / mirror / Atom feed
* Using Eric Raymond's bogofilter tool within Gnus
@ 2002-09-01  3:16 François Pinard
  2002-09-03  9:05 ` Matthias Andree
  0 siblings, 1 reply; 7+ messages in thread
From: François Pinard @ 2002-09-01  3:16 UTC (permalink / raw)


Hello, my friends.

Some of you might be aware of the speedy Graham filter written by Eric Raymond
last week.  Here is the recipe I cooked for using it from within Gnus.
Despite a bit rough, a bit raw, it might be usable.  Of course, if you improve
on it, please tell me, so I can take advantage of your ideas as well! :-)

---------------------------------------------------------------------->
;;; Gnus against SPAM, training a Graham filter via Bogofilter.
;;; François Pinard, 2002-08.

;;; This Emacs Lisp code add a few hooks so Oort Gnus could control spam with
;;; the help of Eric Raymond's Bogofilter.  I use Oort Gnus version 0.8 in
;;; development from CVS, see http://quimby.gnus.org/gnus, and Bogofilter 0.4,
;;; see http://www.tuxedo.org/~esr/bogofilter, slightly patched.

;;; The `M-d' command gets added Gnus summary mode to mark current message as
;;; spam, this is indicated by the letter `H'.  Whenever you see a spam
;;; message, make sure to mark its summary line with `M-d' before leaving the
;;; group.  Some groups, as per variable `fp-junk-mailgroups' below,
;;; automatically get an `H' mark for all summary lines which would otherwise
;;; have no other mark.  These groups are normally receiving spam resulting
;;; from splitting on clues added by spam recognisers.  Make sure to _remove_
;;; `H' marks for any message which is _not_ genuine spam, before leaving the
;;; group (through `M-u' to "unread" the message, or `d' for declaring it read
;;; the non-spam way).  When you leave a group, all `H' marked messages are
;;; sent to Bogofilter which will study them as spam samples.

;;; Messages may also be deleted in various other ways, and unless
;;; `fp-ham-marks-form' gets overridden below, marks `R' and `r' for default
;;; read or explicit delete, marks `X' and 'K' for automatic or explicit
;;; kills, as well as mark `Y' for low scores, are all considered to be
;;; associate with messages which are not spam.  This assumption might be
;;; false, in particular if you use kill files or score files as means for
;;; detecting genuine spam, you should then adjust `fp-ham-marks-form'.  If
;;; you explicit kill a lot, you might sometimes end up with messages marked
;;; `K' which you never saw, and which might accidentally contain spam.  Best
;;; is to make sure that real spam is marked with `H', and nothing else.  When
;;; you leave a group, all messages which the above marks, besides `H' of
;;; course, are sent to Bogofilter which will study these as not-spam samples.

;;; All other marks do not contribute to Bogofilter pre-conditioning.  In
;;; particular, ticked, dormant or souped articles are likely to contribute
;;; later, when they will get deleted for real.  Expired explicitely articles
;;; do not contribute, command `E' is a way to get rid of a message without
;;; Bogofilter ever seeing it.

;;; In a word, with a minimum of care for associating the `H' mark for spam
;;; messages only, Bogofilter training all gets fairly automatic.  You should
;;; do this until you get a few hundreds of messages in both categories, spam
;;; or not.  Use the shell command `head -1 ~/.bogofilter/*' to check message
;;; counts.  Then, the command `S S' in summary mode, either for debugging or
;;; for curiosity, triggers Bogofilter into displaying in another buffer the
;;; "spamicity" score (between 0.0 and 1.0) of the current message, with the
;;; message words which most significantly contributed to the score.

;;; The real way for using Bogofilter, however, is to have some external tool
;;; like `procmail' for invoking it on message reception, adding some
;;; recognisable header in case of detected spam.  Gnus splitting rules might
;;; later trip on these added headers and react by sorting such messages into
;;; specific junk folders.

(defvar fp-spaminfo-header-regexp
  "^X-jf:\\|^X-Junk.*:\\|^X-NoSpam:\\|^X-Spammer:\\|^X-SB"
  "Regexp for spam markups in headers.
Markup from spam recognisers, as well as `Xref', are to be removed from
messages before they get registered by Bogofilter.")

(defvar fp-bogofilter-path (executable-find "bogofilter")
  "Path of the Bogofilter program.")

(defvar fp-junk-mailgroups '("mail.junk" "poste.pourriel")
  "Mailgroups which are dedicated by splitting to receive various junk.
All unmarked article in such group receive the spam mark on group entry.")

(defvar fp-ham-marks-form
  '(list gnus-del-mark gnus-read-mark gnus-killed-mark
			 gnus-kill-file-mark gnus-low-score-mark)
  "Eval form yielding marks considered as being ham (positively not spam).
Such messages will be transmitted to `bogofilter -n' on group exit.")

(defvar fp-spam-marks-form
  '(list gnus-spam-mark)
  "Eval form yielding marks considered as being spam (positively spam).
Such messages will be transmitted to `bogofilter -s' on group exit.")

(defvar fp-ham-marks)
(defvar fp-spam-marks)

(when fp-bogofilter-path
  (eval-after-load "gnus-sum"
    '(progn
       (gnus-define-keys gnus-summary-mode-map
	 "\M-d" gnus-summary-mark-as-spam
	 "SS" fp-gnus-summary-bogofilter-score)
       (add-hook 'gnus-summary-prepare-hook
		 'fp-gnus-summary-mark-junk-as-spam-routine)
       (add-hook 'gnus-summary-prepare-exit-hook
		 'fp-gnus-summary-bogofilter-register-routine)
       (setq fp-ham-marks (eval fp-ham-marks-form)
	     fp-spam-marks (eval fp-spam-marks-form))
       (push '((eq mark gnus-spam-mark) . gnus-splash-face)
	     gnus-summary-highlight))))

(defun fp-gnus-summary-bogofilter-score ()
  "Use `bogofilter -v' on the current message.
This yields the 15 most discriminant words for this message and the
spamicity coefficient of each, and the overall message spamicity."
  (interactive)
  (fp-gnus-bogofilter-articles nil "-v" (list (gnus-summary-article-number))))

(defun fp-gnus-summary-mark-junk-as-spam-routine ()
  (when (member gnus-newsgroup-name fp-junk-mailgroups)
    (let ((articles gnus-newsgroup-articles)
	  article)
      (while articles
	(setq article (pop articles))
	(when (eq (gnus-summary-article-mark article) gnus-unread-mark)
	  (gnus-summary-mark-article article gnus-spam-mark))))))

(defun fp-gnus-summary-bogofilter-register-routine ()
  (let ((articles gnus-newsgroup-articles)
	article mark ham-articles spam-articles)
    (while articles
      (setq article (pop articles)
	    mark (gnus-summary-article-mark article))
      (cond ((memq mark fp-ham-marks) (push article ham-articles))
	    ((memq mark fp-spam-marks) (push article spam-articles))))
    (when ham-articles
      (fp-gnus-bogofilter-articles "ham" "-n" ham-articles))
    (when spam-articles
      (fp-gnus-bogofilter-articles "spam" "-s" spam-articles))))

(defun fp-gnus-bogofilter-articles (type options articles)
  (let* ((output-buffer-name "*Bogofilter Output*")
	 (output-buffer (get-buffer-create output-buffer-name))
	 (article-copy (get-buffer-create " *Bogofilter Article Copy*"))
	 (remove-regexp (concat fp-spaminfo-header-regexp "\\|Xref:"))
	 (prefix (and type (format "Studying %d articles as `%s'..."
				   (length articles) type)))
	 (counter 0)
	 process article)
    (save-excursion (set-buffer output-buffer) (erase-buffer))
    (setq process (start-process "bogofilter" output-buffer
				 fp-bogofilter-path options))
    (process-kill-without-query process t)
    (save-window-excursion
      (while articles
	(when (and prefix (zerop (% counter 10)))
	  (message "%s %d" prefix (1+ counter)))
	(setq counter (1+ counter))
	(gnus-summary-goto-subject (pop articles))
	(gnus-summary-select-article)
	(gnus-eval-in-buffer-window article-copy
	  (insert-buffer-substring gnus-original-article-buffer)
	  ;; Remove spam classification redundant headers: they may induce
	  ;; unwanted biases in Bayesian analysis.
	  (goto-char (point-min))
	  (while (not (or (eobp) (= (following-char) ?\n)))
	    (if (looking-at remove-regexp)
		(delete-region (point)
			       (save-excursion (forward-line 1) (point)))
	      (forward-line 1)))
	  (goto-char (point-min))
	  ;; Bogofilter really wants From envelopes for counting messages.
	  ;; Fake one at the beginning, make sure there will be no other.
	  (if (looking-at "From ")
	      (forward-line 1)
	    (insert "From nobody " (current-time-string) "\n"))
	  (let (case-fold-search)
	    (while (re-search-forward "^From " nil t)
	      (beginning-of-line)
	      (insert ">")))
	  (process-send-region process (point-min) (point-max))
	  (erase-buffer)))
      (kill-buffer article-copy))
    (when prefix
      (message "%s %d" prefix counter))
    (process-send-eof process)
    (while (memq (process-status process) '(run stop))
      (accept-process-output process))
    (when prefix
      (message "%s done!" prefix))
    (save-excursion
      (set-buffer output-buffer)
      (when (= (point-min) (point-max))
	(setq output-buffer nil)))
    (when output-buffer
      (display-message-or-buffer output-buffer output-buffer-name))))
----------------------------------------------------------------------<

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Using Eric Raymond's bogofilter tool within Gnus
  2002-09-01  3:16 Using Eric Raymond's bogofilter tool within Gnus François Pinard
@ 2002-09-03  9:05 ` Matthias Andree
  2002-09-03 13:46   ` François Pinard
  0 siblings, 1 reply; 7+ messages in thread
From: Matthias Andree @ 2002-09-03  9:05 UTC (permalink / raw)
  Cc: Forum of ding/Gnus users

pinard@iro.umontreal.ca (François Pinard) writes:

> Hello, my friends.
>
> Some of you might be aware of the speedy Graham filter written by Eric Raymond
> last week.  Here is the recipe I cooked for using it from within Gnus.
> Despite a bit rough, a bit raw, it might be usable.  Of course, if you improve
> on it, please tell me, so I can take advantage of your ideas as well! :-)

Sorry to be intrusive, but it looks as though "bogofilter" does not
quite work for me, particularly, the -N option does not work (at least
not in 0.6), and I recently got a lot of false positives although I I
fed 2,000 non-spam mails to bogofilter -n and only one spam-mail to
bogofilter -s. I sent Eric a patch to fix the -N breakage and another
bug report yesterday, and will have to figure what else is going
astray. It may not necessarily be bogofilter, it may also be my doing it
the wrong way.

However, there are at least two competing projects that a "Bayesian"
search on freshmeat dug up, but I have not yet had the time to look at
them. From what it looks, your script could easily also support
spamprobe, it's similar to bogofilter in use, only that it uses
cleartext operation mode specifiers rather than options as -n or -s (as
bogofilter does).

1. spamprobe  http://sourceforge.net/projects/spamprobe/
              uses GNU gdbm

2. bayespam   http://www.garyarnold.com/projects.php
              seems to use some db as well
              but looks targetted at qmail
=====
3. bogofilter http://www.tuxedo.org/~esr/bogofilter/
              uses HP's LGPL Judy in a persistent daemon
              and plain text files

-- 
Matthias Andree



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Using Eric Raymond's bogofilter tool within Gnus
  2002-09-03  9:05 ` Matthias Andree
@ 2002-09-03 13:46   ` François Pinard
  2002-09-03 14:13     ` Jeremy H. Brown
  2002-09-11 10:47     ` Matthias Andree
  0 siblings, 2 replies; 7+ messages in thread
From: François Pinard @ 2002-09-03 13:46 UTC (permalink / raw)
  Cc: Forum of ding/Gnus users

[Matthias Andree]

> pinard@iro.umontreal.ca (François Pinard) writes:
>
>> Some of you might be aware of the speedy Graham filter written by Eric
>> Raymond last week.  [...]

> Sorry to be intrusive, but it looks as though "bogofilter" does not
> quite work for me, particularly, the -N option does not work (at least
> not in 0.6),

Give Eric a chance.  The whole project started around two weeks ago, and many
editions brought major overhauls within his code.  Things will stabilise.

For version 0.6, I use `-v', `-n' and `-s' with no serious problems, but
always with `-F' to avoid the split between a client and a server.

> and I recently got a lot of false positives although I I fed 2,000 non-spam
> mails to bogofilter -n and only one spam-mail to bogofilter -s.

People taking this seriously train Graham filters in batch, with corpora
holding thousands of messages, both ham and spam.  I'm happy having results
with on the fly training within Gnus with only a few hundreds of both ham and
spam.  I would expect complete non-sense unless you have at the very least a
few dozens of messages in each category.

> However, there are at least two competing projects that a "Bayesian" search
> on freshmeat dug up, but I have not yet had the time to look at them.

If you do, please share your impression with us! :-)

> From what it looks, your script could easily also support spamprobe, it's
> similar to bogofilter in use, only that it uses cleartext operation mode
> specifiers rather than options as -n or -s (as bogofilter does).

> 1. spamprobe  http://sourceforge.net/projects/spamprobe/
>               uses GNU gdbm

The maintainer of `spamprobe' wrote (I've been told so, I did not read him
directly) that he was not very satisfied with GNU gdbm performance in this
context, and thought about abandoning this approach.

> 2. bayespam   http://www.garyarnold.com/projects.php
>               [...] but looks targetted at qmail

`qmail'?  Given the choice, I would stay away from Daniel Bernstein works.  No
doubt that he is very competent, the problem is not there.  I saw him relate
with others, and I think they are surely not free having to suffer such a
haughtiness.  Yet, for one, I never had the slightest problem with Daniel so
far.  As my feelings about free software are all mixed and blurred with those
of pleasure, collaboration and friendship, `qmail' is not free software. :-)

Let me thank you for the two references above.  Here are other references I
have on Bayes filtering.  I did not look at the last three.

. @ http://www.paulgraham.com/spam.html
. @ http://www.ai.mit.edu/~jrennie/ifile/
. @ http://www.ai.mit.edu/~jhbrown/ifile-gnus.html
. @ http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/spambayes/
. @ http://research.microsoft.com/~jplatt/cikm98.pdf
. : CRM114 on Sourceforge
. @ http://citeseer.nj.nec.com/blum98combining.html

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Using Eric Raymond's bogofilter tool within Gnus
  2002-09-03 13:46   ` François Pinard
@ 2002-09-03 14:13     ` Jeremy H. Brown
  2002-09-03 15:01       ` Kai Großjohann
  2002-09-11 10:47     ` Matthias Andree
  1 sibling, 1 reply; 7+ messages in thread
From: Jeremy H. Brown @ 2002-09-03 14:13 UTC (permalink / raw)
  Cc: Matthias Andree, Forum of ding/Gnus users

pinard@iro.umontreal.ca (François Pinard) writes:
>Let me thank you for the two references above.  Here are other
> references I have on Bayes filtering.  

Let me second those thanks, and thank you as well!  With as many
projects as you've found, it'd be fun to have a running head-to-head
classifier competition.

> . @ http://www.ai.mit.edu/~jrennie/ifile/
> . @ http://www.ai.mit.edu/~jhbrown/ifile-gnus.html

I've been playing with ifile a great deal lately (thus leading to
ifile-gnus). A product review:

The good:

ifile is spectacularly accurate at classifying spam vs. non-spam.  I
love it.  

You can also use ifile as a more general classifier so that it will learn
to file your mail into arbitrary groups; it is reasonably accurate at
that (about 85% according to a paper the ifile author wrote, which
jives with my experience.)

The bad:

85% accurate classification isn't good enough to use across the board.
(Although it's a reasonable way of splitting mail that your
split-rules missed and would otherwise get dumped in your "misc"
group.)

Performance isn't stellar.  This is mostly because the database is
stored in a large flat text file.  With two classes (spam, non-spam)
it's usable with a DB around 200KB; with more classes, the database
rapidly heads towards a meg or so, and ifile becomes almost unusably
slow due to startup overhead.  (Caveat: I'm running it on a slow
computer, with the database stored in NFS, so I'm pretty much the
worst-case imaginable.  I have specific performance numbers; if anyone
wants to them, contact me personally.)


The ugly:

Given a bunch of messages to classify or learn from all at once, ifile
parses them all into memory before moving onto the next step; if your
mailboxes are as out of control as mine, this will cause ifile to run
out of memory and lose.  I don't think there're any fundamental
reasons that this couldn't be fixed.  I have dreams of doing this, and
making ifile run as a daemon to avoid the db-startup overhead.


I'd love to see more reviews of bayesian (or other) spamfilters; it'd
be trivial to mutate ifile-gnus to use just about any command-line
drivable filter.

Jeremy




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Using Eric Raymond's bogofilter tool within Gnus
  2002-09-03 14:13     ` Jeremy H. Brown
@ 2002-09-03 15:01       ` Kai Großjohann
  2002-09-10  0:33         ` Jeremy H. Brown
  0 siblings, 1 reply; 7+ messages in thread
From: Kai Großjohann @ 2002-09-03 15:01 UTC (permalink / raw)
  Cc: François Pinard, Matthias Andree, Forum of ding/Gnus users

jhbrown@ai.mit.edu (Jeremy H. Brown) writes:

> I'd love to see more reviews of bayesian (or other) spamfilters; it'd
> be trivial to mutate ifile-gnus to use just about any command-line
> drivable filter.

Is anyone out there willing to try it on Thorsten Joachim's
svm_light?  The performance of Support Vector Machines appears to be
good both in terms of effectiveness and in terms of efficiency.

kai
-- 
A large number of young women don't trust men with beards.  (BFBS Radio)




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Using Eric Raymond's bogofilter tool within Gnus
  2002-09-03 15:01       ` Kai Großjohann
@ 2002-09-10  0:33         ` Jeremy H. Brown
  0 siblings, 0 replies; 7+ messages in thread
From: Jeremy H. Brown @ 2002-09-10  0:33 UTC (permalink / raw)
  Cc: François Pinard, Matthias Andree, Forum of ding/Gnus users

Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:
> jhbrown@ai.mit.edu (Jeremy H. Brown) writes:
> 
> > I'd love to see more reviews of bayesian (or other) spamfilters; it'd
> > be trivial to mutate ifile-gnus to use just about any command-line
> > drivable filter.
> 
> Is anyone out there willing to try it on Thorsten Joachim's
> svm_light?  The performance of Support Vector Machines appears to be
> good both in terms of effectiveness and in terms of efficiency.

I had a brief talk with a friend of mine who's an SVM expert.  He said
that in practice, the classification is likely to be a little better
with an SVM, but training, especially incremental training, is likely
to be relatively computationally intensive.  Naive Bayes is nice
because it's so computationally simple...

Jeremy





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Using Eric Raymond's bogofilter tool within Gnus
  2002-09-03 13:46   ` François Pinard
  2002-09-03 14:13     ` Jeremy H. Brown
@ 2002-09-11 10:47     ` Matthias Andree
  1 sibling, 0 replies; 7+ messages in thread
From: Matthias Andree @ 2002-09-11 10:47 UTC (permalink / raw)


On Tue, 03 Sep 2002, François Pinard wrote:

> > From what it looks, your script could easily also support spamprobe, it's
> > similar to bogofilter in use, only that it uses cleartext operation mode
> > specifiers rather than options as -n or -s (as bogofilter does).
> 
> > 1. spamprobe  http://sourceforge.net/projects/spamprobe/
> >               uses GNU gdbm
> 
> The maintainer of `spamprobe' wrote (I've been told so, I did not read him
> directly) that he was not very satisfied with GNU gdbm performance in this
> context, and thought about abandoning this approach.

I've been using this for some days now, and it's fast and working
alright, some missed positives, and only three false positives in the
beginning, but no more now.

I have killed bogofilter's data base and let it train by SpamAssassin,
but bogofilter 0.7 keeps flagging most things as spam even when it
isn't. Something must still be wrong with this bogofilter.

> > 2. bayespam   http://www.garyarnold.com/projects.php
> >               [...] but looks targetted at qmail

Still haven't tried this.

> `qmail'?  Given the choice, I would stay away from Daniel Bernstein works.  No
> doubt that he is very competent, the problem is not there.  I saw him relate

Qmail has some unfixed bugs and incomplete documentation, but there is
no way reporting these bugs -- that will lead qmail disciples to a "talk
the error away" or "you don't need that" strategy, and make things look
quite contrasting to DJB's qmail BLURB finally. But that's a different
story, I'd only vote against offering special "qmail-inject" interfaces
in Gnus. A qmail bug list is at http://mandree.home.pages.de/qmail-bugs.html



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2002-09-11 10:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-01  3:16 Using Eric Raymond's bogofilter tool within Gnus François Pinard
2002-09-03  9:05 ` Matthias Andree
2002-09-03 13:46   ` François Pinard
2002-09-03 14:13     ` Jeremy H. Brown
2002-09-03 15:01       ` Kai Großjohann
2002-09-10  0:33         ` Jeremy H. Brown
2002-09-11 10:47     ` Matthias Andree

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).