Re: Bogofilter

Announcements and discussions for Gnus, the GNU Emacs Usenet newsreader
 help / color / mirror / Atom feed

* Re: Bogofilter
       [not found]     ` <m2r8ayh7zm.fsf_-_@bluesteel.grierwhite.com>
@ 2003-01-27 19:01       ` David Z Maze
       [not found]         ` <9cfvg09rx37.fsf@rogue.ncsl.nist.gov>
  0 siblings, 1 reply; 10+ messages in thread
From: David Z Maze @ 2003-01-27 19:01 UTC (permalink / raw)


chris@grierwhite.com (Christopher J. White) writes:
> I'm intrigued...my spam load has reached critical levels, so I'm
> interested in trying out some spam filtering techniques.  Are you
> using bogofilter with gnus?  If so, what's your experience, does it
> ever filter out "good" email as spam (as it's statistical in nature)?
> If you don't have several hundred spam messages around to train it,
> how else do you train it, or do I have to wait til I get enough saved
> up spam?

I use ifile, not bogofilter, but the two packages seem to be similar
in nature.  Yes, I get some false positives with ifile.  It's also not
like I didn't have a big pile of spam sitting around before I started
using ifile (I hand-sorted it into its own group); if nothing else,
you can set up a Hotmail account and wait for a couple of weeks.  :-)

> Finally, how do you integrate bogofilter with gnus for mail reading
> (read via POP3, stored as nnml).

You're using

> User-Agent: Gnus/5.090014 (Oort Gnus v0.14) Emacs/21.2 (powerpc-apple-darwin)

so you can just use the functions in spam.el.  My .gnus file has

(setq
      gnus-spam-newsgroup-contents
      '(("nnml:mail.misc.spam" gnus-group-spam-classification-spam)
	("nnml:.*" gnus-group-spam-classification-ham))
      gnus-spam-process-newsgroups
      '(("nnml:.*" (gnus-group-spam-exit-processor-ifile
		    gnus-group-ham-exit-processor-ifile)))
      gnus-spam-process-destinations
      '(("nnml:.*" "nnml:mail.misc.spam"))
      spam-junk-mailgroups '("mail.misc.spam")
      spam-split-group "mail.misc.spam"
      spam-use-ifile t
)

and I call (: spam-split) in nnmail-split-fancy.  I think all of these
should Just Work for you if you change group names appropriately and
substitute bogofilter for ifile.

-- 
David Maze             dmaze@mit.edu          http://www.mit.edu/~dmaze/
"Theoretical politics is interesting.  Politicking should be illegal."
	-- Abra Mitchell


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bogofilter
       [not found]         ` <9cfvg09rx37.fsf@rogue.ncsl.nist.gov>
@ 2003-01-29  7:10           ` Kai Großjohann
  2003-01-29 19:59             ` Bogofilter Ian Soboroff
  0 siblings, 1 reply; 10+ messages in thread
From: Kai Großjohann @ 2003-01-29  7:10 UTC (permalink / raw)


Ian Soboroff <org@acm.isoboroff> writes:

> It's nice to see such a craze over naive-Bayes filtering techniques,
> but they can get overtrained pretty easily.

Yeah.  I don't know much about automatic classification, but I seem
to recall that naive-Bayes is not the most effective method.

So are there better algorithms around and is there an implementation
that can be integrated into Gnus, similar to ifile?
-- 
Ambibibentists unite!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: spam assassin filtering
       [not found] ` <87ptqiodvb.fsf@unix.home>
@ 2003-01-29 11:06   ` Alain Picard
  2003-01-29 15:35     ` Michael Below
  0 siblings, 1 reply; 10+ messages in thread
From: Alain Picard @ 2003-01-29 11:06 UTC (permalink / raw)


deskpot@despammed.com (Vasily Korytov) writes:

> Yep, it works here. It's very simple here: I have procmail as MDA and I
> call spamc (spamd is run at startup) from my ~/.procmailrc. Then I have
> ("junk.spam" "^X-Spam-Status: Yes") entry in my nnmail-split-methods.

I was hoping for a procmail-free solution, as this is on a laptop
system, and I prefer to get the mail "on demand", rather than
from a procmail daemon.

But thanks for the tip, I may have to use it nonetheless.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: spam assassin filtering
  2003-01-29 11:06   ` spam assassin filtering Alain Picard
@ 2003-01-29 15:35     ` Michael Below
       [not found]       ` <87u1fr7lwn.fsf@jan.korger>
  0 siblings, 1 reply; 10+ messages in thread
From: Michael Below @ 2003-01-29 15:35 UTC (permalink / raw)


Alain Picard <apicard+die-spammer-die@optushome.com.au> writes:

> deskpot@despammed.com (Vasily Korytov) writes:
>
>> Yep, it works here. It's very simple here: I have procmail as MDA
>> and I call spamc (spamd is run at startup) from my
>> ~/.procmailrc. Then I have ("junk.spam" "^X-Spam-Status: Yes")
>> entry in my nnmail-split-methods.
>
> I was hoping for a procmail-free solution, as this is on a laptop
> system, and I prefer to get the mail "on demand", rather than from a
> procmail daemon.

I didn't even know that procmail can be used as a daemon. Just start
fetchmail on each IP-up, and make fetchmail hand the mail over to
procmail (using procmail as a MDA). Then procmail pipes the mail
through spamassassin (you don't have to use spamd/spamc) and does some
sorting. No need for daemons.

Michael
-- 
_Agricultural activity_ is the management by an enterprise of the
biological transformation of biological assets for sale, into
agricultural produce, or into additional biological assets.  IAS 41,5


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bogofilter
  2003-01-29  7:10           ` Bogofilter Kai Großjohann
@ 2003-01-29 19:59             ` Ian Soboroff
       [not found]               ` <4nlm12ehn5.fsf@lockgroove.bwh.harvard.edu>
  0 siblings, 1 reply; 10+ messages in thread
From: Ian Soboroff @ 2003-01-29 19:59 UTC (permalink / raw)

kai.grossjohann@uni-duisburg.de (Kai Großjohann) writes:

> Ian Soboroff <org@acm.isoboroff> writes:
>
>> It's nice to see such a craze over naive-Bayes filtering techniques,
>> but they can get overtrained pretty easily.
>
> Yeah.  I don't know much about automatic classification, but I seem
> to recall that naive-Bayes is not the most effective method.
>
> So are there better algorithms around and is there an implementation
> that can be integrated into Gnus, similar to ifile?

There are boatloads of text classification algorithms.  Naive Bayes is
the canonical second best solution to any problem, and has the
advantage of being fast.  Support Vector Machines are better but NB
can get quite close in some data.  SVMs are hard to update, but to be
honest an email classifier could probably be just fine retraining
overnight.

My favorite classifier tool is Andrew McCallum's BOW toolkit.  It does
NB, SVM, kNN, EM, and probably three other things I forgot about, and
has nice support for doing measurements and experiments.  I was _this_
close to writing a scoring module for Gnus based on it, when I ran
across ifile.

The _right_ thing to do is something like nnir, that is, a classifier
framework that you can plug anything into underneath.  ifile-gnus.el
is probably most of what's needed (plus a couple more functions to
easily move mail without triggering a reclassification).

Ian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: spam assassin filtering
       [not found]       ` <87u1fr7lwn.fsf@jan.korger>
@ 2003-01-29 21:14         ` Vasily Korytov
  2003-01-29 21:52           ` Tim Haynes
  0 siblings, 1 reply; 10+ messages in thread
From: Vasily Korytov @ 2003-01-29 21:14 UTC (permalink / raw)


>>>>> "JK" == Jan Korger writes:

 JK> BUT if you use an MTA, i.e. deliever from fetchmail to port 25 or call
 JK> sendmail or similar, you will end up if 1 procmail + 1 spamassassin(perl)
 JK> process per message plus some MTA processes. This is verly likely to
 JK> eat up all of your RAM and end in an disaster, i.e. you will loose your
 JK> mail. (This happend to me while downloading ~100 messages.)

Use spamd/spamc pair. Anyway, my old modem takes care of not losing my
mail in such situations. =))

---Vas


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: spam assassin filtering
  2003-01-29 21:14         ` Vasily Korytov
@ 2003-01-29 21:52           ` Tim Haynes
  0 siblings, 0 replies; 10+ messages in thread
From: Tim Haynes @ 2003-01-29 21:52 UTC (permalink / raw)

deskpot@despammed.com (Vasily Korytov) writes:

>  JK> process per message plus some MTA processes. This is verly likely to
>  JK> eat up all of your RAM and end in an disaster, i.e. you will loose
>  JK> your mail. (This happend to me while downloading ~100 messages.)
>
> Use spamd/spamc pair. Anyway, my old modem takes care of not losing my
> mail in such situations. =))

Fetchmail will not flush a message off the upstream server if it gets a
failure code from the local delivery agent. 

spamc/d do make it considerably quicker processing a mail, as the perl
interpreter is only invoked the once.

You can add locking to procmail rules (the fine manpage mentions appending
a `:' to the end of the intro line to a recipe triple).

Me, I have my colo-swerver handle the initial incoming mails, bogofilter
being invoked for anything questionable; the mails that pass are copied on
to the ISP at home and pulled down with fetchmail and re-bogofiltered as
well. Seems to work :)

~Tim
-- 
But mountains are holy places,              |piglet@stirfried.vegetable.org.uk
And beauty is free / We can still walk      |http://spodzone.org.uk/
Through the garden                          |
Our earth was once green                    |

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bogofilter
       [not found]               ` <4nlm12ehn5.fsf@lockgroove.bwh.harvard.edu>
@ 2003-01-30 18:09                 ` Kai Großjohann
  2003-01-31 16:46                   ` Bogofilter Ted Zlatanov
  0 siblings, 1 reply; 10+ messages in thread
From: Kai Großjohann @ 2003-01-30 18:09 UTC (permalink / raw)


Ted Zlatanov <tzz@lifelogs.com> writes:

> On Wed, 29 Jan 2003, org@acm.isoboroff wrote:
>> The _right_ thing to do is something like nnir, that is, a
>> classifier framework that you can plug anything into underneath.
>
> I'm sort of working on that right now, it will be a generic framework
> for spam.el.

Ian is talking about text classification which could be used for
general splitting (not just the spam/ham thing that spam.el does).

This doesn't mean that spam.el is bad, just that it solves a
different problem.

I think that the tracking part could be useful for using text
classification for splitting.  But maybe it's enough to add the right
hooks to Gnus, and spam.el uses them in one way whereas the
use-a-classifier-for-splitting thing uses them in another way.
-- 
Ambibibentists unite!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bogofilter
  2003-01-30 18:09                 ` Bogofilter Kai Großjohann
@ 2003-01-31 16:46                   ` Ted Zlatanov
  0 siblings, 0 replies; 10+ messages in thread
From: Ted Zlatanov @ 2003-01-31 16:46 UTC (permalink / raw)

On Thu, 30 Jan 2003, kai.grossjohann@uni-duisburg.de wrote:
> Ted Zlatanov <tzz@lifelogs.com> writes:
> 
>> On Wed, 29 Jan 2003, org@acm.isoboroff wrote:
>>> The _right_ thing to do is something like nnir, that is, a
>>> classifier framework that you can plug anything into underneath.
>>
>> I'm sort of working on that right now, it will be a generic
>> framework for spam.el.
> 
> Ian is talking about text classification which could be used for
> general splitting (not just the spam/ham thing that spam.el does).
> 
> This doesn't mean that spam.el is bad, just that it solves a
> different problem.

> I think that the tracking part could be useful for using text
> classification for splitting.  But maybe it's enough to add the
> right hooks to Gnus, and spam.el uses them in one way whereas the
> use-a-classifier-for-splitting thing uses them in another way.

I should have said "I'm working on a generic framework, which will be
used by spam.el".  I want to make it easy to track a message as it
exists in Gnus, and spam.el will use that but other packages can too.
The idea will be to add any data you want to a message ID, and have
that data persist as the message moves around.  spam.el will attach
things like "processed as spam by bogofilter," "moved to group ABC,"
or "split by ifile into group XYZ."

Right now I'm in the "thinking about it" stage, no code yet...

Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: spam assassin filtering
       [not found] <874r7vclji.fsf@ibook.optushome.com.au>
@ 2003-01-27 15:19 ` Jay Belanger
       [not found] ` <844r7vjc3l.fsf@lucy.is.informatik.uni-duisburg.de>
       [not found] ` <87ptqiodvb.fsf@unix.home>
  2 siblings, 0 replies; 10+ messages in thread
From: Jay Belanger @ 2003-01-27 15:19 UTC (permalink / raw)



Alain Picard <apicard+die-spammer-die@optushome.com.au> writes:

> Hello all,
>
> Does anyone have a working setup of GNUS/spam assassin working?
> The setup described in the manual (Oort v.7) doesn't seem to
> work for me.

The method in the cvs manual is to use fancy splitting with

(setq nnmail-split-fancy '(| (: kevin-spamassassin)
                             ...))
(defun kevin-spamassassin ()
  (save-excursion
    (let ((buf (or (get-buffer " *nnmail incoming*")
                   (get-buffer " *nnml move*"))))
      (if (not buf)
          (progn (message "Oops, cannot find message buffer") nil)
        (set-buffer buf)
        (if (eq 1 (call-process-region (point-min) (point-max)
                                       "spamc" nil nil nil "-c"))
            "spam")))))

This didn't work for me either.  The output of spamc with the -c
switch is 1 if the article is spam, but the function above seems to
assume that the value of 'call-process-region' is the same as the
output of spamc. I don't think that's the case, particularly here.
If I'm right, this should probably be changed in the manual.

> Any example of a configuration known to work in real life
> would be helpful.

There are a couple of ways of doing it in the lisp files on
http://my.gnus.org/Lisp
I have it set up the following way (with help from the above function,
and the functions on my.gnus.org); it works for me.

==== from my .gnus =======
(setq nnmail-split-methods 'nnmail-split-fancy)

(defun jpb-spamassassin ()
  (with-temp-buffer
    (if (get-buffer " *nnmail incoming*")
        (insert-buffer " *nnmail incoming*")
      (insert-buffer gnus-original-article-buffer))
    (call-process-region (point-min) (point-max)
                         "spamc" t t nil "-f")
    (goto-char (point-min))
    (when (re-search-forward "^X-Spam-Status: Yes" nil t)
      "Spam")))

(setq nnmail-split-fancy
  '(|
     <... a bunch of splits ...>
     (: jpb-spamassassin)
     "Misc"))
==========================

Jay


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-01-31 16:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <874r7vclji.fsf@ibook.optushome.com.au>
2003-01-27 15:19 ` spam assassin filtering Jay Belanger
     [not found] ` <844r7vjc3l.fsf@lucy.is.informatik.uni-duisburg.de>
     [not found]   ` <81k7gq1z6a.fsf@shasta.cs.uiuc.edu>
     [not found]     ` <m2r8ayh7zm.fsf_-_@bluesteel.grierwhite.com>
2003-01-27 19:01       ` Bogofilter David Z Maze
     [not found]         ` <9cfvg09rx37.fsf@rogue.ncsl.nist.gov>
2003-01-29  7:10           ` Bogofilter Kai Großjohann
2003-01-29 19:59             ` Bogofilter Ian Soboroff
     [not found]               ` <4nlm12ehn5.fsf@lockgroove.bwh.harvard.edu>
2003-01-30 18:09                 ` Bogofilter Kai Großjohann
2003-01-31 16:46                   ` Bogofilter Ted Zlatanov
     [not found] ` <87ptqiodvb.fsf@unix.home>
2003-01-29 11:06   ` spam assassin filtering Alain Picard
2003-01-29 15:35     ` Michael Below
     [not found]       ` <87u1fr7lwn.fsf@jan.korger>
2003-01-29 21:14         ` Vasily Korytov
2003-01-29 21:52           ` Tim Haynes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).