[9fans] bayesian spam filtering

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] bayesian spam filtering
@ 2003-09-01 14:53 Russ Cox
  0 siblings, 0 replies; 2+ messages in thread
From: Russ Cox @ 2003-09-01 14:53 UTC (permalink / raw)
  To: 9fans

I've spent a few days here and there working on it,
and it's easy to get something half-decent but hard to
get something that's really quite good.

If you're interested in what I've done already, I just
packaged up all the pieces (I think) and put them in
/n/sources/extra/spam.tar.gz.  See /n/sources/extra/spam.notes
for the list of files I included as well as a note about
how to use it that I wrote but never circulated.

I think my stuff is close -- the biggest wart is that
it uses Berkeley DB to store the frequency tables.
I think that might be necessary, but I'm not really sure.
Tokenization is hard.  Msgcat does a reasonable job.

I've actually been using Mozilla Thunderbird to process
my mail mostly (though I'm typing this in nedmail).
The one feature it has that I really like is the
spam handling.  It could be better in some ways,
but the spam interface is much better than what I built.
When mail comes in, it's automatically flagged as
spam or not.  You can then correct its mistakes (if any)
by just clicking on lines in the message listing.
Then once you're happy you click "Delete Mail Marked as Junk"
and away they go.  If acme and nedmail had a way
way to select disjoint sets of messages, I think
that would help a lot.  I've found that although
Thunderbird's filtering is not as accurate as what
I built, the interface makes up for it.

Notice I said process, not read.  I still read with
nedmail and Mail, at least when I'm in Plan 9, but
to do things like kill off the spam and manage lots
of different mail boxes, the external mail readers
seem like a big win.  Sad but true.

Russ

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [9fans] bayesian spam filtering
@ 2003-09-19  6:07 Russ Cox
  0 siblings, 0 replies; 2+ messages in thread
From: Russ Cox @ 2003-09-19  6:07 UTC (permalink / raw)
  To: 9fans

Nemo cleaned up my spam filtering scripts to make them
work again, and then I cleaned up some other things,
and I put them in the distribution.  They're still fairly
experimental -- at the least I plan to rename the internal
commands to make a little more sense.

Here's an overview that will serve as documentation until
I get a chance to do a better job.  I'm pushing this out
because the minutely 150kB attachments I've been getting
were enough to make me want to have a filter in place
before going to sleep for a few hundred messages.

The basic idea is that your mail and your spam are so
different that simple word frequency analysis is
more than enough to tell them apart.  Further, the
character of spam changes only slowly over time, so as long
as the filter can keep up, it never falls out of date.

To use it, run /mail/lib/setup.bayes.  That will save
a copy of your current pipeto, if you have one, and
write a pipeto that invokes the bayes filter.

Each user has their own profile of what is spam and
what is normal mail.  To start, setup.bayes copies
profiles based on the spam we've collected over the
last year and based on my own mailbox from a few
months ago.  These should be good enough to get you
started, and then the profiles will adapt to you as
you use the system.

When mail comes in, it is classified.  The result is
logged in /mail/box/you/_bounced.  If the message is
deemed spam, it is delivered normally but marked with
SPAM: at the beginning of the subject line.  This is
a good way to run the filter at first, when you're
still training it.  Once you're more confident in
the filter's choices, you can edit $mail/pipeto
to change what happens to spam.  Change the line
"spool-tagged-spam" to, say, "spool /mail/box/$USER/spam"
to spool spam into a separate "spam" mbox.

Occasionally you will want to "mail -f spam" to view
the spam mail box and see if anything important has
fallen into it.  I've found that commercial email (like
receipts from online stores) and spam are very hard for
this filter to tell apart, so these almost always must be fished
out of the spam box.  Except for that, I get hardly
any misclassified messages.  It's fine to delete the
spam that has accumulated.

When the filter classifies a message as spam or not,
it updates its internal profiles.  Assuming it's making
correct decisions, this helps it track changing message
profiles.  Of course, if it makes a wrong decision, updating
the profile will compound that wrong decision, making it
more likely next time.  So you need to explicitly tell
it about mistakes so that the mistakes don't compound.
At a nedmail prompt after viewing the misclassified message,
you can do one of

	||upas/spam    - note that message should be spam
	||upas/unspam  - note that message is NOT spam
	||upas/isspam  - show words used to make decision
	                 about whether message is spam

For example, if you got a message about the Holmdel
Coffee Hour, you might run ||upas/spam followed by
||upas/isspam to see if the profile had changed enough
to mark that message as spam.  If not, and you were feeling
persistent, you might run ||upas/spam again and again
until the profile had changed enough.

Acme Mail users can type and then execute Spam, Unspam,
or Isspam in the message tag to accomplish the same things.

The spam filter is computationally a little expensive.
It adds a second or two to the delivery of messages.

Russ

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2003-09-19  6:07 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-09-01 14:53 [9fans] bayesian spam filtering Russ Cox
2003-09-19  6:07 Russ Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).