Gnus development mailing list
 help / color / mirror / Atom feed
From: Jonas Steverud <tvrud@bredband.net>
Subject: Re: wallowing out of the spam quagmire
Date: Tue, 22 Jun 2004 09:52:11 +0200	[thread overview]
Message-ID: <m24qp4rpw4.fsf@c-9a5372d5.036-4-67626721.cust.bredbandsbolaget.se> (raw)
In-Reply-To: <m3d63s9ymb.fsf@newsguy.com> (Harry Putnam's message of "Mon, 21 Jun 2004 20:21:00 -0500")

Harry Putnam <reader@newsguy.com> writes:

Note: I use No Gnus v0.2.

> Jonas Steverud <tvrud@bredband.net> writes:
>
[...]
>> Both yes and no. The problem is to understand how spam.el works. It is
>> not complex, the documentation is simply not yet complete. Read it
>> before you continue with this email.
>
> I'm not sure we're from the same planetary system... or as bare
> minimum you must have a rather bizarre notion of what `not complex'
> means.  I went glassy eyed after the first couple hundred lines.

As I said, the documentation is not yet finsihed. ;-) You only
confirmed what I said: "The problem is to understand how spam.el
works. [...] the documentation is simply not yet complete."

>>>    1) procmail/SpamAssassin based pre filtering (before gnus)
>>
>> I assume it places all spam in a specific group, lets for the
>> discussion call it nnfolder:Spam.
[...]
> So to summarize.  I let procmail/sa do most splitting and culling out
> of spam.  When that is done, the rest comes to my inbox and I deal
> with it by hand.  I hoped to introduce bogofilter at that stage.

OK.

First: The fancy splitting is the same as splitting (which you already
had used) but gives the possibility for more complex rules. If you
don't want Gnus to filter/split yor mail, leave it out.

The way bogofilter works is to eat the email it is given and either

a. If it is told to train, bogofilter updates its databases of words
that exist in spams and in hams (if the email is considered spam or
ham is set by command line parameters).

b. If it is told to classify it checks its databases and from that
calculates the probability that the email is spam or ham. It reports
YES or NO.

Bogofilter don't give a d*mn about what spam.el does. It eats emails
and either train on them or classifies it. Period.

Spam.el does not care about which program you use for training and
classifying. It has an interface to different backends and lets them
handle that - the same approach Gnus has toward messages, nntp, pop,
imap, all messages in one file or one file for each message and so on.

What you need to do is to tell spam.el what it shall do and with which
backends.

First, some terminology:

I will call the main mailbox you described as Inbox, this is where all
mails procmail and sa haven't done anything with. Some will be spam
and the rest will be ham. There are also two other groups: Spam and
Ham.


So, now we are set.

You can do two different things. You can move any found spam in Inbox
to Spam and train bogofilter in Spam or you can train bogofilter in
Inbox and leave the spam there. If you use expire in Inbox the latter
is IMHO preferred, but it is all about taste. There is no correct
answer in this case. I will assume you want it moved to Spam and
bogofilter to train in Spam.

First, tell spam.el to use bogofilter. Which backend you use doesn't
matter so if you want to use another backend later, just search and
replace with your new backend.

Add (setq spam-use-bogofilter t) to your .gnus.el.
Also add (spam-initialize)  and make sure it is the last line of all
spam related code in .gnus.el. I.e. add any further spam.el related
stuff *before* this line.

I think you need (setq spam-move-spam-nonspam-groups-only t) as well.

You need to tell spam.el that any spam found in Inbox is to be moved
to Spam. Edit the group parameters of Inbox (I assume you know how to
do that, ask otherwise) to contain the following lines:

 (spam-process-destination "Spam") ;; You might need to add
 "nnfolder:" or whatever you use as mail backend.

In case you want all ham (everything else) to be moved to Ham, add
these lines:
 (ham-process-destination "Ham") ;; Read comment above.
 (ham-marks
   (gnus-read-mark gnus-killed-mark)) ;; All according to taste.

Now, all spam will be moved to Spam when you exit Inbox. All mails you
consider to be spam you mark with M-d or S x (same function). If you
want spam.el to go through your Inbox folder and mark all spam as such
for you (i.e. all emails bogofilter consider is spam), add the
following line to the Inbox group parameters:

 (spam-autodetect-methods spam-use-bogofilter)
 (spam-autodetect . t)

The group parameters shall contain:
 Spam: (spam-contents gnus-group-spam-classification-spam)
 Ham:  (spam-contents gnus-group-spam-classification-ham)
So spam.el knows what to expect in the groups.


If I got everything right, all mails in Inbox will be checked upon
entry for spam. Any spam will be marked with $. Upon exit, all spam
(autodetected and marked by you) will be moved to Spam and all ham
(what is considered ham is decided by the ham-mark above) will be
moved to Ham. Everything that is not marked as spam neither ham will
stay in place.

When you exit Ham and Spam, bogofilter will train on them as ham and
spam respectively.

It is as important to train on ham as on spam since bogofilter will
not otherwise know how to detect ham and will consider everything as
spam (your email will be present in the spam and bogofilter will
consider the presence of this as a sure sign of spam - if you train on
ham as well it will see that your email is also a sure sign of ham,
i.e. not a word to go by).

(load-library "std-disclaimer") ;-) This is all from the top of my
head and I might have missed something.

> Seems like one would just invoke bogofilter on each message and send
> each one to spam or ham.  Technically a split, I guess but not very
> complicated. The complicated part seems to be what goes on inside
> bogofilter.  The messages it will be seeing have already skirted SA's
> complex set of interrelated rules, plus my own homeboy procmail rules
> and tweaks to SA.  So this mail will be hard to find a pattern or some
> other thing to help indentify it.

Bogofilter keeps a statistical database of all words that exist in the
email and know if the email was considered (by you) as a ham or
spam. When detecting spam, it checks the database for each word and
applies a mathematical formula.

The database can look like this (my spam database):
FDA-Approved 1 20040610
FDA-approved 2 20040613
FDZTb0mAPVS 1 20040607
FEEL 8 20040417
FFFF00 1 20040416

Google for a description of Bayesian filters (sp?), it is quite simple
actually. Bogofilter will detect spams that the statical rules in sa
has missed. I.e. all different spelling of Viagra: V1agra, V1ag.ra etc.

The idea is "keep a database of all good words and all bad words and
check the email and whichever has the highest ranking classifies the
message".

> My case should be the simplest possible example of using spam.el and
> bogofilter, but I'm not sure about involving gnus registry etc.
> Or what `exactly' needs doing.

The registry is a database (a lisp list actually) of all message ids
and which group they exit in. Some lines in the documentation suggests
that you need to use it for autodetection (can someone else
confirm?). In that case, add
(setq spam-log-to-registry t)
(gnus-registry-initialize)

HTH.

-- 
(        http://hem.bredband.net/steverud/        !     Wei Wu Wei     )
(        Meaning of U2 Lyrics, Roleplaying        !  To Do Without Do  )




  parent reply	other threads:[~2004-06-22  7:52 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-06-19 18:27 Harry Putnam
2004-06-20  6:58 ` Jonas Steverud
2004-06-22  1:21   ` Harry Putnam
2004-06-22  1:53     ` Jody Klymak
2004-06-22 10:56       ` Harry Putnam
2004-06-22 15:03         ` Jody Klymak
2004-06-22 15:20         ` Jody Klymak
2004-06-22  7:52     ` Jonas Steverud [this message]
2004-06-22 15:18       ` Jody Klymak
2004-06-22 16:34       ` Ted Zlatanov
2004-06-22 16:32     ` Ted Zlatanov
2004-06-25 13:37     ` Kai Grossjohann
2004-06-25 14:26       ` Daniel Pittman
2004-06-25 18:46         ` Chris Green
2004-06-26 10:34         ` Harry Putnam
2004-06-26 14:55           ` [OT] Dual-MTA setup and spam filtering (was Re: wallowing out of the spam quagmire) Daniel Pittman
2004-06-26 10:18       ` wallowing out of the spam quagmire Harry Putnam
2004-06-20 23:44 ` Kevin Ryde
2004-06-21  4:28   ` Daniel Pittman
2004-06-21 14:35 ` Ted Zlatanov
2004-06-22  1:40   ` Harry Putnam
2004-06-22 16:45     ` Ted Zlatanov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m24qp4rpw4.fsf@c-9a5372d5.036-4-67626721.cust.bredbandsbolaget.se \
    --to=tvrud@bredband.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).