From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/57937 Path: main.gmane.org!not-for-mail From: Jonas Steverud Newsgroups: gmane.emacs.gnus.general Subject: Re: wallowing out of the spam quagmire Date: Tue, 22 Jun 2004 09:52:11 +0200 Organization: The Deciples of Albericht Nibelungen Sender: ding-owner@lists.math.uh.edu Message-ID: References: NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1087890777 19013 80.91.224.253 (22 Jun 2004 07:52:57 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 22 Jun 2004 07:52:57 +0000 (UTC) Original-X-From: ding-owner+M6478@lists.math.uh.edu Tue Jun 22 09:52:44 2004 Return-path: Original-Received: from malifon.math.uh.edu ([129.7.128.13]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1Bcg5D-0003Qf-00 for ; Tue, 22 Jun 2004 09:52:43 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.math.uh.edu) by malifon.math.uh.edu with smtp (Exim 3.20 #1) id 1Bcg4L-0001ZJ-00; Tue, 22 Jun 2004 02:51:49 -0500 Original-Received: from util2.math.uh.edu ([129.7.128.23]) by malifon.math.uh.edu with esmtp (Exim 3.20 #1) id 1Bcg4D-0001ZD-00 for ding@lists.math.uh.edu; Tue, 22 Jun 2004 02:51:41 -0500 Original-Received: from justine.libertine.org ([66.139.78.221] ident=postfix) by util2.math.uh.edu with esmtp (Exim 4.30) id 1Bcg4C-000735-JV for ding@lists.math.uh.edu; Tue, 22 Jun 2004 02:51:40 -0500 Original-Received: from mxfep02.bredband.com (mxfep02.bredband.com [195.54.107.73]) by justine.libertine.org (Postfix) with ESMTP id C136F3A01FB for ; Tue, 22 Jun 2004 02:51:35 -0500 (CDT) Original-Received: from c-9a5372d5.036-4-67626721.cust.bredbandsbolaget.se.bredband.net ([213.114.83.231] [213.114.83.231]) by mxfep02.bredband.com with ESMTP id <20040622075130.KRUC26240.mxfep02.bredband.com@c-9a5372d5.036-4-67626721.cust.bredbandsbolaget.se.bredband.net> for ; Tue, 22 Jun 2004 09:51:30 +0200 Original-To: ding@gnus.org Mail-Copies-To: never In-Reply-To: (Harry Putnam's message of "Mon, 21 Jun 2004 20:21:00 -0500") User-Agent: Gnus/5.110002 (No Gnus v0.2) Emacs/21.3 (darwin) Precedence: bulk Xref: main.gmane.org gmane.emacs.gnus.general:57937 X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:57937 Harry Putnam writes: Note: I use No Gnus v0.2. > Jonas Steverud writes: > [...] >> Both yes and no. The problem is to understand how spam.el works. It is >> not complex, the documentation is simply not yet complete. Read it >> before you continue with this email. > > I'm not sure we're from the same planetary system... or as bare > minimum you must have a rather bizarre notion of what `not complex' > means. I went glassy eyed after the first couple hundred lines. As I said, the documentation is not yet finsihed. ;-) You only confirmed what I said: "The problem is to understand how spam.el works. [...] the documentation is simply not yet complete." >>> 1) procmail/SpamAssassin based pre filtering (before gnus) >> >> I assume it places all spam in a specific group, lets for the >> discussion call it nnfolder:Spam. [...] > So to summarize. I let procmail/sa do most splitting and culling out > of spam. When that is done, the rest comes to my inbox and I deal > with it by hand. I hoped to introduce bogofilter at that stage. OK. First: The fancy splitting is the same as splitting (which you already had used) but gives the possibility for more complex rules. If you don't want Gnus to filter/split yor mail, leave it out. The way bogofilter works is to eat the email it is given and either a. If it is told to train, bogofilter updates its databases of words that exist in spams and in hams (if the email is considered spam or ham is set by command line parameters). b. If it is told to classify it checks its databases and from that calculates the probability that the email is spam or ham. It reports YES or NO. Bogofilter don't give a d*mn about what spam.el does. It eats emails and either train on them or classifies it. Period. Spam.el does not care about which program you use for training and classifying. It has an interface to different backends and lets them handle that - the same approach Gnus has toward messages, nntp, pop, imap, all messages in one file or one file for each message and so on. What you need to do is to tell spam.el what it shall do and with which backends. First, some terminology: I will call the main mailbox you described as Inbox, this is where all mails procmail and sa haven't done anything with. Some will be spam and the rest will be ham. There are also two other groups: Spam and Ham. So, now we are set. You can do two different things. You can move any found spam in Inbox to Spam and train bogofilter in Spam or you can train bogofilter in Inbox and leave the spam there. If you use expire in Inbox the latter is IMHO preferred, but it is all about taste. There is no correct answer in this case. I will assume you want it moved to Spam and bogofilter to train in Spam. First, tell spam.el to use bogofilter. Which backend you use doesn't matter so if you want to use another backend later, just search and replace with your new backend. Add (setq spam-use-bogofilter t) to your .gnus.el. Also add (spam-initialize) and make sure it is the last line of all spam related code in .gnus.el. I.e. add any further spam.el related stuff *before* this line. I think you need (setq spam-move-spam-nonspam-groups-only t) as well. You need to tell spam.el that any spam found in Inbox is to be moved to Spam. Edit the group parameters of Inbox (I assume you know how to do that, ask otherwise) to contain the following lines: (spam-process-destination "Spam") ;; You might need to add "nnfolder:" or whatever you use as mail backend. In case you want all ham (everything else) to be moved to Ham, add these lines: (ham-process-destination "Ham") ;; Read comment above. (ham-marks (gnus-read-mark gnus-killed-mark)) ;; All according to taste. Now, all spam will be moved to Spam when you exit Inbox. All mails you consider to be spam you mark with M-d or S x (same function). If you want spam.el to go through your Inbox folder and mark all spam as such for you (i.e. all emails bogofilter consider is spam), add the following line to the Inbox group parameters: (spam-autodetect-methods spam-use-bogofilter) (spam-autodetect . t) The group parameters shall contain: Spam: (spam-contents gnus-group-spam-classification-spam) Ham: (spam-contents gnus-group-spam-classification-ham) So spam.el knows what to expect in the groups. If I got everything right, all mails in Inbox will be checked upon entry for spam. Any spam will be marked with $. Upon exit, all spam (autodetected and marked by you) will be moved to Spam and all ham (what is considered ham is decided by the ham-mark above) will be moved to Ham. Everything that is not marked as spam neither ham will stay in place. When you exit Ham and Spam, bogofilter will train on them as ham and spam respectively. It is as important to train on ham as on spam since bogofilter will not otherwise know how to detect ham and will consider everything as spam (your email will be present in the spam and bogofilter will consider the presence of this as a sure sign of spam - if you train on ham as well it will see that your email is also a sure sign of ham, i.e. not a word to go by). (load-library "std-disclaimer") ;-) This is all from the top of my head and I might have missed something. > Seems like one would just invoke bogofilter on each message and send > each one to spam or ham. Technically a split, I guess but not very > complicated. The complicated part seems to be what goes on inside > bogofilter. The messages it will be seeing have already skirted SA's > complex set of interrelated rules, plus my own homeboy procmail rules > and tweaks to SA. So this mail will be hard to find a pattern or some > other thing to help indentify it. Bogofilter keeps a statistical database of all words that exist in the email and know if the email was considered (by you) as a ham or spam. When detecting spam, it checks the database for each word and applies a mathematical formula. The database can look like this (my spam database): FDA-Approved 1 20040610 FDA-approved 2 20040613 FDZTb0mAPVS 1 20040607 FEEL 8 20040417 FFFF00 1 20040416 Google for a description of Bayesian filters (sp?), it is quite simple actually. Bogofilter will detect spams that the statical rules in sa has missed. I.e. all different spelling of Viagra: V1agra, V1ag.ra etc. The idea is "keep a database of all good words and all bad words and check the email and whichever has the highest ranking classifies the message". > My case should be the simplest possible example of using spam.el and > bogofilter, but I'm not sure about involving gnus registry etc. > Or what `exactly' needs doing. The registry is a database (a lisp list actually) of all message ids and which group they exit in. Some lines in the documentation suggests that you need to use it for autodetection (can someone else confirm?). In that case, add (setq spam-log-to-registry t) (gnus-registry-initialize) HTH. -- ( http://hem.bredband.net/steverud/ ! Wei Wu Wei ) ( Meaning of U2 Lyrics, Roleplaying ! To Do Without Do )