Gnus development mailing list
 help / color / mirror / Atom feed
* An alternative to spambayes.el for those using Gnus
@ 2006-11-11 23:09 Florent Rougon
  2006-11-14 15:14 ` Ted Zlatanov
  0 siblings, 1 reply; 4+ messages in thread
From: Florent Rougon @ 2006-11-11 23:09 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 1123 bytes --]

Hi,

I've been running my own interface code between Gnus and Spambayes for a
while, and improved it a bit today to the point that I think it should
be ready for public consumption.

It can do the same things as spambayes.el, but in a way that should be
cleaner and slightly faster (using `call-process-region' instead of
`shell-command-on-region', for instance). It also provides a few more
things, most notably:

  - a command for (re-)running the classifier on an article (or
    process-marked articles). Useful when you've recently trained
    Spambayes and want to see how the newly-trained filter
    performs---and maybe even respool some articles with this new
    filter.

  - a command to examine what the Spambayes filter thinks of an article
    (read-only operation): whether it is classified as ham or spam, the
    overall spam score as well as the various spam clues with their
    respective scores (from the 'X-Spambayes-Evidence' header).

This has been tested with GNU Emacs 21.4, Spambayes 1.0.3 and No Gnus
v0.6 (also with Gnus v5.10.7).

It works well for me, and I hope others will find it useful.

[-- Attachment #2: Interface between Spambayes and Gnus --]
[-- Type: application/emacs-lisp, Size: 11592 bytes --]

[-- Attachment #3: Type: text/plain, Size: 13 bytes --]


-- 
Florent

[-- Attachment #4: Type: text/plain, Size: 183 bytes --]

_______________________________________________
spambayes-dev mailing list
spambayes-dev-+ZN9ApsXKcEdnm+yROfE0A@public.gmane.org
http://mail.python.org/mailman/listinfo/spambayes-dev

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: An alternative to spambayes.el for those using Gnus
  2006-11-11 23:09 An alternative to spambayes.el for those using Gnus Florent Rougon
@ 2006-11-14 15:14 ` Ted Zlatanov
       [not found]   ` <g69y7qenjqf.fsf-mIZUurteI1BWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Ted Zlatanov @ 2006-11-14 15:14 UTC (permalink / raw)
  Cc: spambayes-dev

On 11 Nov 2006, f.rougon@free.fr wrote:

> I've been running my own interface code between Gnus and Spambayes for a
> while, and improved it a bit today to the point that I think it should
> be ready for public consumption.
>
> It can do the same things as spambayes.el, but in a way that should be
> cleaner and slightly faster (using `call-process-region' instead of
> `shell-command-on-region', for instance). It also provides a few more
> things, most notably:
>
> - a command for (re-)running the classifier on an article (or
> process-marked articles). Useful when you've recently trained
> Spambayes and want to see how the newly-trained filter
> performs---and maybe even respool some articles with this new
> filter.
>
> - a command to examine what the Spambayes filter thinks of an article
> (read-only operation): whether it is classified as ham or spam, the
> overall spam score as well as the various spam clues with their
> respective scores (from the 'X-Spambayes-Evidence' header).
>
> This has been tested with GNU Emacs 21.4, Spambayes 1.0.3 and No Gnus
> v0.6 (also with Gnus v5.10.7).
>
> It works well for me, and I hope others will find it useful.

Hi,

would you consider merging your code with the Gnus spam.el system?

You need to write a backend, which includes:

- a spam/ham check function (1 function)
- spam/ham register/unregister functions (4 functions)

Plus update several variables.  It's not a lot of work.  Let me know
if you are interested.  Your spambayes.el can exist with spam.el
(exporting functions for its use) or you can merge the code right in.
It's up to you.

spam.el does much of the infrastructure you mention above, especially
deciding when to run the classifier and on which articles.

Thanks
Ted



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: An alternative to spambayes.el for those using Gnus
       [not found]   ` <g69y7qenjqf.fsf-mIZUurteI1BWk0Htik3J/w@public.gmane.org>
@ 2006-11-19 22:50     ` Florent Rougon
       [not found]       ` <874psvrqzg.fsf-l0fJEUGJ2jSwWQYQPWwDOw@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Florent Rougon @ 2006-11-19 22:50 UTC (permalink / raw)
  Cc: spambayes-dev-+ZN9ApsXKcEdnm+yROfE0A

Hi,

Ted Zlatanov <tzz-mIZUurteI1BWk0Htik3J/w@public.gmane.org> wrote:

> would you consider merging your code with the Gnus spam.el system?

Sorry for the late reply. I was a bit busy and wanted to reread the
"Spam Package Introduction" Info node to avoid making an uninformed
answer.

Having just read it, I'm not sure the scheme implemented in spam.el fits
well with the way I want to work with Spambayes. One of the reasons is
that I do *not* want to train the filter on every article. To have an
efficient Spambayes filter, experiments made by Spambayes users and
developers have shown that it is often a good idea to only train the
filter on its mistakes (after an initial training).

[ Personally, I don't even train the filter on every mistake, because
  there are articles that I believe are too well-crafted spam: I fear
  I'll pollute my Spambayes database if I train on these articles. These
  are articles that mostly contain words that are part of my usual
  ham. ]

Therefore, I wouldn't want the "spam and ham processors" to do anything
when I exit a group. I want to carefully select which articles get to
train the filter.

As a consequence, the paragraph in the "Spam Package Introduction" node
that reads:

,----
|    If the spam filter failed to mark a spam message, you can mark it
| yourself, so that the message is processed as spam when you exit the
| group:
| 
| `M-d'
| `M s x'
| `S x'
|      Mark current article as spam, showing it with the `$' mark
|      (`gnus-summary-mark-as-spam').
| 
| Similarly, you can unmark an article if it has been erroneously marked
| as spam.  *Note Setting Marks::.
`----

would be misleading to users, because marking articles as ham or spam
wouldn't make any difference in the absence of any action from the "spam
and ham processors".

There's another thing in spam.el that doesn't seem to work the way I
want:

,----
|    The second thing that the Spam package does when you exit a group is
| to move ham articles out of spam groups, and spam articles out of ham
| groups.  Ham in a spam group is moved to the group specified by the
| variable `gnus-ham-process-destinations', or the group parameter
| `ham-process-destination'.  Spam in a ham group is moved to the group
| specified by the variable `gnus-spam-process-destinations', or the
| group parameter `spam-process-destination'.
`----

This means that if, e.g., I had a ham that was classified as spam and I
mark it as ham before leaving the group, then the article will be moved
to the group specified by `gnus-ham-process-destinations'---regardless
of the specific article.

I prefer my way of doing that: if an article is misclassifed, there are
two possibilities:
  - either I don't want to train the filter on the article (for
    instance, because several similar articles were misclassifed in a
    row and I already trained the filter on one of them). In this case,
    I usually simply use 'B m' to move the article manually to the right
    group.

    There is another possiblity that works well in the example I gave in
    the parenthesis: since the filter was trained on a similar article,
    you can expect it to classify the article correctly next time;
    therfore, you can call '(flo-spambayes-gnus-classify t)' in order
    to:

       1. rerun the classifier on the article;
       2. respool it afterwards (this is because of the "t" argument).

    The respooled article will eventually end up in the right group
    according to `nnmail-split-methods'.

  - or I use 'B s' (resp. 'B h') to tell the filter "Dude, this was
    spam!" (resp. "Dude, this was ham!"), i.e., I train the filter on
    the article. These key sequences, which are mapped to lambda
    expressions evaluating '(flo-spambayes-gnus-refile-as-spam t)' and
    '(flo-spambayes-gnus-refile-as-ham t)' respectively, do two things:

       1. train the filter on the article;
       2. respool it afterwards (this is because of the "t" argument).

    As a consequence, the article will (most probably) end up in the
    right group, according to `nnmail-split-methods'.

    [ I say "most probably", because it might be that the filter was so
      badly trained in the past that it still couldn't classify the
      article correctly the second time. This never happened to me, but
      I think it's possible. ]

The key point here is that in either case, if the article was, e.g.,
something for the ding mailing-list wrongly classified as spam when the
incoming mail was split, it will end up directly in my "ding" group
after the corrective actions I described, not in whichever group
specified by `gnus-ham-process-destinations'.

Lastly, there's another thing I'm not sure about when reading the Info
node:

,----
| The Spam package divides Gnus groups into three categories: ham
| groups, spam groups, and unclassified groups.
`----

What exactly do unclassified groups contain? With Spambayes, when you
run an article through the classifer, it gets a spam score (between 0
and 1) and a category depending on the spam score. There are three
categories: ham, unsure and spam (from lowest score to highest score).
"unsure" means the article got a score that is not low enough to be
confident it's ham, and not high enough to be confident it's spam. But
it surely doesn't mean the article wasn't _classifed_ (i.e., it did go
through the classifier---whose output was "unsure"). That's why I'm not
sure the "unclassified group" mentioned in the above sentence is
well-suited for articles marked as "unsure" by Spambayes.

To rephrase it differently: you said a spam backend must provide a
function that tells whether a message is ham or spam. But this is not
suited to Spambayes, since there are 3 possible outcomes from the filter
by default, not 2 (unless you tweak it to make the "unsure" score range
vanish, but that would be silly in most cases).

Regards,

-- 
Florent

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: An alternative to spambayes.el for those using Gnus
       [not found]       ` <874psvrqzg.fsf-l0fJEUGJ2jSwWQYQPWwDOw@public.gmane.org>
@ 2006-11-21 18:50         ` Ted Zlatanov
  0 siblings, 0 replies; 4+ messages in thread
From: Ted Zlatanov @ 2006-11-21 18:50 UTC (permalink / raw)
  Cc: spambayes-dev-+ZN9ApsXKcEdnm+yROfE0A

On 19 Nov 2006, f.rougon-GANU6spQydw@public.gmane.org wrote:

> Having just read it, I'm not sure the scheme implemented in spam.el fits
> well with the way I want to work with Spambayes. One of the reasons is
> that I do *not* want to train the filter on every article. To have an
> efficient Spambayes filter, experiments made by Spambayes users and
> developers have shown that it is often a good idea to only train the
> filter on its mistakes (after an initial training).

OK.  This *can* be the usage mode, but basically we leave it up to the
user, and it's a global choice.  Read on...

> [ Personally, I don't even train the filter on every mistake, because
> there are articles that I believe are too well-crafted spam: I fear
> I'll pollute my Spambayes database if I train on these articles. These
> are articles that mostly contain words that are part of my usual
> ham. ]
>
> Therefore, I wouldn't want the "spam and ham processors" to do anything
> when I exit a group. I want to carefully select which articles get to
> train the filter.

OK, then you don't want spam or ham groups, which are the only groups
where automatic action is taken.  Unclassified groups have the
behavior that only explicitly marked (by you) spam is processed by a
backend.

> As a consequence, the paragraph in the "Spam Package Introduction" node
> that reads:
>
> ,----
> | If the spam filter failed to mark a spam message, you can mark it
> | yourself, so that the message is processed as spam when you exit the
> | group:
> |
> | `M-d'
> | `M s x'
> | `S x'
> |      Mark current article as spam, showing it with the `$' mark
> |      (`gnus-summary-mark-as-spam').
> |
> | Similarly, you can unmark an article if it has been erroneously marked
> | as spam.  *Note Setting Marks::.
> `----
>
> would be misleading to users, because marking articles as ham or spam
> wouldn't make any difference in the absence of any action from the "spam
> and ham processors".

I'm not sure what you mean.  In any group, whatever articles are
marked as spam on exit, are processed as spam by the group's spam
backends.  Spam groups have some extra behavior here.  If the group is
unclassified (neither ham nor spam group) then no automatic spam
marking will be done, but the processing is always done.

> There's another thing in spam.el that doesn't seem to work the way I
> want:
>
> ,----
> | The second thing that the Spam package does when you exit a group is
> | to move ham articles out of spam groups, and spam articles out of ham
> | groups.  Ham in a spam group is moved to the group specified by the
> | variable `gnus-ham-process-destinations', or the group parameter
> | `ham-process-destination'.  Spam in a ham group is moved to the group
> | specified by the variable `gnus-spam-process-destinations', or the
> | group parameter `spam-process-destination'.
> `----
>
> This means that if, e.g., I had a ham that was classified as spam and I
> mark it as ham before leaving the group, then the article will be moved
> to the group specified by `gnus-ham-process-destinations'---regardless
> of the specific article.
>
> I prefer my way of doing that: if an article is misclassifed, there are
> two possibilities:
> - either I don't want to train the filter on the article (for
> instance, because several similar articles were misclassifed in a
> row and I already trained the filter on one of them). In this case,
> I usually simply use 'B m' to move the article manually to the right
> group.

OK.  This doesn't interfere with the spam.el processing.

> There is another possiblity that works well in the example I gave in
> the parenthesis: since the filter was trained on a similar article,
> you can expect it to classify the article correctly next time;
> therfore, you can call '(flo-spambayes-gnus-classify t)' in order
> to:
>
> 1. rerun the classifier on the article;
> 2. respool it afterwards (this is because of the "t" argument).
>
> The respooled article will eventually end up in the right group
> according to `nnmail-split-methods'.

We have a 'respool spam or ham destination which will do the
respooling you describe.  You can use it in addition to any spam
backends for that group.

> - or I use 'B s' (resp. 'B h') to tell the filter "Dude, this was
> spam!" (resp. "Dude, this was ham!"), i.e., I train the filter on
> the article. These key sequences, which are mapped to lambda
> expressions evaluating '(flo-spambayes-gnus-refile-as-spam t)' and
> '(flo-spambayes-gnus-refile-as-ham t)' respectively, do two things:
>
> 1. train the filter on the article;
> 2. respool it afterwards (this is because of the "t" argument).
>
> As a consequence, the article will (most probably) end up in the
> right group, according to `nnmail-split-methods'.
>
> [ I say "most probably", because it might be that the filter was so
> badly trained in the past that it still couldn't classify the
> article correctly the second time. This never happened to me, but
> I think it's possible. ]
>
> The key point here is that in either case, if the article was, e.g.,
> something for the ding mailing-list wrongly classified as spam when the
> incoming mail was split, it will end up directly in my "ding" group
> after the corrective actions I described, not in whichever group
> specified by `gnus-ham-process-destinations'.

I think you want immediate spam/ham processing and to see what
happened right away.  spam.el doesn't do that because it's very slow
for some filters, deferring the action to the time you exit the group
instead (batching all backend processing).  I think it could be done
for individual backends, or per group, though.

> Lastly, there's another thing I'm not sure about when reading the Info
> node:
>
> ,----
> | The Spam package divides Gnus groups into three categories: ham
> | groups, spam groups, and unclassified groups.
> `----
>
> What exactly do unclassified groups contain? With Spambayes, when you
> run an article through the classifer, it gets a spam score (between 0
> and 1) and a category depending on the spam score. There are three
> categories: ham, unsure and spam (from lowest score to highest score).
> "unsure" means the article got a score that is not low enough to be
> confident it's ham, and not high enough to be confident it's spam. But
> it surely doesn't mean the article wasn't _classifed_ (i.e., it did go
> through the classifier---whose output was "unsure"). That's why I'm not
> sure the "unclassified group" mentioned in the above sentence is
> well-suited for articles marked as "unsure" by Spambayes.

Spam groups: all unread messages are marked as spam when you enter.
Unclassified groups: no extra marking is done.
Ham groups: no extra marking is done.

All other differences are for summary exit processing.  So the type of
group has to do with marking and processing, and most of the work is
aimed at making sure that spam ends up in spam groups and processed by
a spam backend, and ham outside spam groups and processed by ham
backends.

> To rephrase it differently: you said a spam backend must provide a
> function that tells whether a message is ham or spam. But this is not
> suited to Spambayes, since there are 3 possible outcomes from the filter
> by default, not 2 (unless you tweak it to make the "unsure" score range
> vanish, but that would be silly in most cases).

Actually you can also return nil, which means "unsure" :)  In the
context of nnmail-split-methods that means "go to the next method."

spam.el tries to be very flexible, and the rules are aimed at making
the user's life easier.  If you think the docs or the workflow are
confusing, I'll be glad to take any suggestions you have.

Ted

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-11-21 18:50 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-11-11 23:09 An alternative to spambayes.el for those using Gnus Florent Rougon
2006-11-14 15:14 ` Ted Zlatanov
     [not found]   ` <g69y7qenjqf.fsf-mIZUurteI1BWk0Htik3J/w@public.gmane.org>
2006-11-19 22:50     ` Florent Rougon
     [not found]       ` <874psvrqzg.fsf-l0fJEUGJ2jSwWQYQPWwDOw@public.gmane.org>
2006-11-21 18:50         ` Ted Zlatanov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).