From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/57937
Path: main.gmane.org!not-for-mail
From: Jonas Steverud <tvrud@bredband.net>
Newsgroups: gmane.emacs.gnus.general
Subject: Re: wallowing out of the spam quagmire
Date: Tue, 22 Jun 2004 09:52:11 +0200
Organization: The Deciples of Albericht Nibelungen
Sender: ding-owner@lists.math.uh.edu
Message-ID: <m24qp4rpw4.fsf@c-9a5372d5.036-4-67626721.cust.bredbandsbolaget.se>
References: <m31xkbz9lg.fsf@newsguy.com>
	<m2k6y2u35s.fsf@c-9a5372d5.036-4-67626721.cust.bredbandsbolaget.se>
	<m3d63s9ymb.fsf@newsguy.com>
NNTP-Posting-Host: deer.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: sea.gmane.org 1087890777 19013 80.91.224.253 (22 Jun 2004 07:52:57 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Tue, 22 Jun 2004 07:52:57 +0000 (UTC)
Original-X-From: ding-owner+M6478@lists.math.uh.edu Tue Jun 22 09:52:44 2004
Return-path: <ding-owner+M6478@lists.math.uh.edu>
Original-Received: from malifon.math.uh.edu ([129.7.128.13])
	by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 1Bcg5D-0003Qf-00
	for <ding-account@gmane.org>; Tue, 22 Jun 2004 09:52:43 +0200
Original-Received: from localhost ([127.0.0.1] helo=lists.math.uh.edu)
	by malifon.math.uh.edu with smtp (Exim 3.20 #1)
	id 1Bcg4L-0001ZJ-00; Tue, 22 Jun 2004 02:51:49 -0500
Original-Received: from util2.math.uh.edu ([129.7.128.23])
	by malifon.math.uh.edu with esmtp (Exim 3.20 #1)
	id 1Bcg4D-0001ZD-00
	for ding@lists.math.uh.edu; Tue, 22 Jun 2004 02:51:41 -0500
Original-Received: from justine.libertine.org ([66.139.78.221] ident=postfix)
	by util2.math.uh.edu with esmtp (Exim 4.30)
	id 1Bcg4C-000735-JV
	for ding@lists.math.uh.edu; Tue, 22 Jun 2004 02:51:40 -0500
Original-Received: from mxfep02.bredband.com (mxfep02.bredband.com [195.54.107.73])
	by justine.libertine.org (Postfix) with ESMTP id C136F3A01FB
	for <ding@gnus.org>; Tue, 22 Jun 2004 02:51:35 -0500 (CDT)
Original-Received: from c-9a5372d5.036-4-67626721.cust.bredbandsbolaget.se.bredband.net ([213.114.83.231] [213.114.83.231])
          by mxfep02.bredband.com with ESMTP
          id <20040622075130.KRUC26240.mxfep02.bredband.com@c-9a5372d5.036-4-67626721.cust.bredbandsbolaget.se.bredband.net>
          for <ding@gnus.org>; Tue, 22 Jun 2004 09:51:30 +0200
Original-To: ding@gnus.org
Mail-Copies-To: never
In-Reply-To: <m3d63s9ymb.fsf@newsguy.com> (Harry Putnam's message of "Mon,
 21 Jun 2004 20:21:00 -0500")
User-Agent: Gnus/5.110002 (No Gnus v0.2) Emacs/21.3 (darwin)
Precedence: bulk
Xref: main.gmane.org gmane.emacs.gnus.general:57937
X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:57937

Harry Putnam <reader@newsguy.com> writes:

Note: I use No Gnus v0.2.

> Jonas Steverud <tvrud@bredband.net> writes:
>
[...]
>> Both yes and no. The problem is to understand how spam.el works. It is
>> not complex, the documentation is simply not yet complete. Read it
>> before you continue with this email.
>
> I'm not sure we're from the same planetary system... or as bare
> minimum you must have a rather bizarre notion of what `not complex'
> means.  I went glassy eyed after the first couple hundred lines.

As I said, the documentation is not yet finsihed. ;-) You only
confirmed what I said: "The problem is to understand how spam.el
works. [...] the documentation is simply not yet complete."

>>>    1) procmail/SpamAssassin based pre filtering (before gnus)
>>
>> I assume it places all spam in a specific group, lets for the
>> discussion call it nnfolder:Spam.
[...]
> So to summarize.  I let procmail/sa do most splitting and culling out
> of spam.  When that is done, the rest comes to my inbox and I deal
> with it by hand.  I hoped to introduce bogofilter at that stage.

OK.

First: The fancy splitting is the same as splitting (which you already
had used) but gives the possibility for more complex rules. If you
don't want Gnus to filter/split yor mail, leave it out.

The way bogofilter works is to eat the email it is given and either

a. If it is told to train, bogofilter updates its databases of words
that exist in spams and in hams (if the email is considered spam or
ham is set by command line parameters).

b. If it is told to classify it checks its databases and from that
calculates the probability that the email is spam or ham. It reports
YES or NO.

Bogofilter don't give a d*mn about what spam.el does. It eats emails
and either train on them or classifies it. Period.

Spam.el does not care about which program you use for training and
classifying. It has an interface to different backends and lets them
handle that - the same approach Gnus has toward messages, nntp, pop,
imap, all messages in one file or one file for each message and so on.

What you need to do is to tell spam.el what it shall do and with which
backends.

First, some terminology:

I will call the main mailbox you described as Inbox, this is where all
mails procmail and sa haven't done anything with. Some will be spam
and the rest will be ham. There are also two other groups: Spam and
Ham.


So, now we are set.

You can do two different things. You can move any found spam in Inbox
to Spam and train bogofilter in Spam or you can train bogofilter in
Inbox and leave the spam there. If you use expire in Inbox the latter
is IMHO preferred, but it is all about taste. There is no correct
answer in this case. I will assume you want it moved to Spam and
bogofilter to train in Spam.

First, tell spam.el to use bogofilter. Which backend you use doesn't
matter so if you want to use another backend later, just search and
replace with your new backend.

Add (setq spam-use-bogofilter t) to your .gnus.el.
Also add (spam-initialize)  and make sure it is the last line of all
spam related code in .gnus.el. I.e. add any further spam.el related
stuff *before* this line.

I think you need (setq spam-move-spam-nonspam-groups-only t) as well.

You need to tell spam.el that any spam found in Inbox is to be moved
to Spam. Edit the group parameters of Inbox (I assume you know how to
do that, ask otherwise) to contain the following lines:

 (spam-process-destination "Spam") ;; You might need to add
 "nnfolder:" or whatever you use as mail backend.

In case you want all ham (everything else) to be moved to Ham, add
these lines:
 (ham-process-destination "Ham") ;; Read comment above.
 (ham-marks
   (gnus-read-mark gnus-killed-mark)) ;; All according to taste.

Now, all spam will be moved to Spam when you exit Inbox. All mails you
consider to be spam you mark with M-d or S x (same function). If you
want spam.el to go through your Inbox folder and mark all spam as such
for you (i.e. all emails bogofilter consider is spam), add the
following line to the Inbox group parameters:

 (spam-autodetect-methods spam-use-bogofilter)
 (spam-autodetect . t)

The group parameters shall contain:
 Spam: (spam-contents gnus-group-spam-classification-spam)
 Ham:  (spam-contents gnus-group-spam-classification-ham)
So spam.el knows what to expect in the groups.


If I got everything right, all mails in Inbox will be checked upon
entry for spam. Any spam will be marked with $. Upon exit, all spam
(autodetected and marked by you) will be moved to Spam and all ham
(what is considered ham is decided by the ham-mark above) will be
moved to Ham. Everything that is not marked as spam neither ham will
stay in place.

When you exit Ham and Spam, bogofilter will train on them as ham and
spam respectively.

It is as important to train on ham as on spam since bogofilter will
not otherwise know how to detect ham and will consider everything as
spam (your email will be present in the spam and bogofilter will
consider the presence of this as a sure sign of spam - if you train on
ham as well it will see that your email is also a sure sign of ham,
i.e. not a word to go by).

(load-library "std-disclaimer") ;-) This is all from the top of my
head and I might have missed something.

> Seems like one would just invoke bogofilter on each message and send
> each one to spam or ham.  Technically a split, I guess but not very
> complicated. The complicated part seems to be what goes on inside
> bogofilter.  The messages it will be seeing have already skirted SA's
> complex set of interrelated rules, plus my own homeboy procmail rules
> and tweaks to SA.  So this mail will be hard to find a pattern or some
> other thing to help indentify it.

Bogofilter keeps a statistical database of all words that exist in the
email and know if the email was considered (by you) as a ham or
spam. When detecting spam, it checks the database for each word and
applies a mathematical formula.

The database can look like this (my spam database):
FDA-Approved 1 20040610
FDA-approved 2 20040613
FDZTb0mAPVS 1 20040607
FEEL 8 20040417
FFFF00 1 20040416

Google for a description of Bayesian filters (sp?), it is quite simple
actually. Bogofilter will detect spams that the statical rules in sa
has missed. I.e. all different spelling of Viagra: V1agra, V1ag.ra etc.

The idea is "keep a database of all good words and all bad words and
check the email and whichever has the highest ranking classifies the
message".

> My case should be the simplest possible example of using spam.el and
> bogofilter, but I'm not sure about involving gnus registry etc.
> Or what `exactly' needs doing.

The registry is a database (a lisp list actually) of all message ids
and which group they exit in. Some lines in the documentation suggests
that you need to use it for autodetection (can someone else
confirm?). In that case, add
(setq spam-log-to-registry t)
(gnus-registry-initialize)

HTH.

-- 
(        http://hem.bredband.net/steverud/        !     Wei Wu Wei     )
(        Meaning of U2 Lyrics, Roleplaying        !  To Do Without Do  )