From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <839cda4a8f51f642839cac76de286f23@plan9.bell-labs.com>
From: "Russ Cox" <rsc@mit.edu>
To: 9fans@cse.psu.edu
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
Subject: [9fans] bayesian spam filtering
Date: Fri, 19 Sep 2003 02:07:37 -0400
Topicbox-Message-UUID: 3d1c49d8-eacc-11e9-9e20-41e7f4b1d025

Nemo cleaned up my spam filtering scripts to make them
work again, and then I cleaned up some other things,
and I put them in the distribution.  They're still fairly
experimental -- at the least I plan to rename the internal
commands to make a little more sense.

Here's an overview that will serve as documentation until
I get a chance to do a better job.  I'm pushing this out
because the minutely 150kB attachments I've been getting
were enough to make me want to have a filter in place
before going to sleep for a few hundred messages.

The basic idea is that your mail and your spam are so
different that simple word frequency analysis is
more than enough to tell them apart.  Further, the
character of spam changes only slowly over time, so as long
as the filter can keep up, it never falls out of date.

To use it, run /mail/lib/setup.bayes.  That will save
a copy of your current pipeto, if you have one, and
write a pipeto that invokes the bayes filter.

Each user has their own profile of what is spam and
what is normal mail.  To start, setup.bayes copies
profiles based on the spam we've collected over the
last year and based on my own mailbox from a few
months ago.  These should be good enough to get you
started, and then the profiles will adapt to you as
you use the system.

When mail comes in, it is classified.  The result is
logged in /mail/box/you/_bounced.  If the message is
deemed spam, it is delivered normally but marked with
SPAM: at the beginning of the subject line.  This is
a good way to run the filter at first, when you're
still training it.  Once you're more confident in
the filter's choices, you can edit $mail/pipeto
to change what happens to spam.  Change the line
"spool-tagged-spam" to, say, "spool /mail/box/$USER/spam"
to spool spam into a separate "spam" mbox.

Occasionally you will want to "mail -f spam" to view
the spam mail box and see if anything important has
fallen into it.  I've found that commercial email (like
receipts from online stores) and spam are very hard for
this filter to tell apart, so these almost always must be fished
out of the spam box.  Except for that, I get hardly
any misclassified messages.  It's fine to delete the
spam that has accumulated.

When the filter classifies a message as spam or not,
it updates its internal profiles.  Assuming it's making
correct decisions, this helps it track changing message
profiles.  Of course, if it makes a wrong decision, updating
the profile will compound that wrong decision, making it
more likely next time.  So you need to explicitly tell
it about mistakes so that the mistakes don't compound.
At a nedmail prompt after viewing the misclassified message,
you can do one of

	||upas/spam    - note that message should be spam
	||upas/unspam  - note that message is NOT spam
	||upas/isspam  - show words used to make decision
	                 about whether message is spam

For example, if you got a message about the Holmdel
Coffee Hour, you might run ||upas/spam followed by
||upas/isspam to see if the profile had changed enough
to mark that message as spam.  If not, and you were feeling
persistent, you might run ||upas/spam again and again
until the profile had changed enough.

Acme Mail users can type and then execute Spam, Unspam,
or Isspam in the message tag to accomplish the same things.

The spam filter is computationally a little expensive.
It adds a second or two to the delivery of messages.

Russ