Gnus development mailing list
 help / color / mirror / Atom feed
From: jhbrown@ai.mit.edu (Jeremy H. Brown)
Cc: Matthias Andree <ma@dt.e-technik.uni-dortmund.de>,
	Forum of ding/Gnus users <ding@gnus.org>
Subject: Re: Using Eric Raymond's bogofilter tool within Gnus
Date: 03 Sep 2002 10:13:59 -0400	[thread overview]
Message-ID: <uv6bs7fi2jc.fsf@suspiria.ai.mit.edu> (raw)
In-Reply-To: <oqfzwr6vaf.fsf@titan.progiciels-bpi.ca>

pinard@iro.umontreal.ca (François Pinard) writes:
>Let me thank you for the two references above.  Here are other
> references I have on Bayes filtering.  

Let me second those thanks, and thank you as well!  With as many
projects as you've found, it'd be fun to have a running head-to-head
classifier competition.

> . @ http://www.ai.mit.edu/~jrennie/ifile/
> . @ http://www.ai.mit.edu/~jhbrown/ifile-gnus.html

I've been playing with ifile a great deal lately (thus leading to
ifile-gnus). A product review:

The good:

ifile is spectacularly accurate at classifying spam vs. non-spam.  I
love it.  

You can also use ifile as a more general classifier so that it will learn
to file your mail into arbitrary groups; it is reasonably accurate at
that (about 85% according to a paper the ifile author wrote, which
jives with my experience.)

The bad:

85% accurate classification isn't good enough to use across the board.
(Although it's a reasonable way of splitting mail that your
split-rules missed and would otherwise get dumped in your "misc"
group.)

Performance isn't stellar.  This is mostly because the database is
stored in a large flat text file.  With two classes (spam, non-spam)
it's usable with a DB around 200KB; with more classes, the database
rapidly heads towards a meg or so, and ifile becomes almost unusably
slow due to startup overhead.  (Caveat: I'm running it on a slow
computer, with the database stored in NFS, so I'm pretty much the
worst-case imaginable.  I have specific performance numbers; if anyone
wants to them, contact me personally.)


The ugly:

Given a bunch of messages to classify or learn from all at once, ifile
parses them all into memory before moving onto the next step; if your
mailboxes are as out of control as mine, this will cause ifile to run
out of memory and lose.  I don't think there're any fundamental
reasons that this couldn't be fixed.  I have dreams of doing this, and
making ifile run as a daemon to avoid the db-startup overhead.


I'd love to see more reviews of bayesian (or other) spamfilters; it'd
be trivial to mutate ifile-gnus to use just about any command-line
drivable filter.

Jeremy




  reply	other threads:[~2002-09-03 14:13 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-09-01  3:16 François Pinard
2002-09-03  9:05 ` Matthias Andree
2002-09-03 13:46   ` François Pinard
2002-09-03 14:13     ` Jeremy H. Brown [this message]
2002-09-03 15:01       ` Kai Großjohann
2002-09-10  0:33         ` Jeremy H. Brown
2002-09-11 10:47     ` Matthias Andree

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=uv6bs7fi2jc.fsf@suspiria.ai.mit.edu \
    --to=jhbrown@ai.mit.edu \
    --cc=ding@gnus.org \
    --cc=ma@dt.e-technik.uni-dortmund.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).