From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <839cda4a8f51f642839cac76de286f23@plan9.bell-labs.com> From: "Russ Cox" To: 9fans@cse.psu.edu MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: [9fans] bayesian spam filtering Date: Fri, 19 Sep 2003 02:07:37 -0400 Topicbox-Message-UUID: 3d1c49d8-eacc-11e9-9e20-41e7f4b1d025 Nemo cleaned up my spam filtering scripts to make them work again, and then I cleaned up some other things, and I put them in the distribution. They're still fairly experimental -- at the least I plan to rename the internal commands to make a little more sense. Here's an overview that will serve as documentation until I get a chance to do a better job. I'm pushing this out because the minutely 150kB attachments I've been getting were enough to make me want to have a filter in place before going to sleep for a few hundred messages. The basic idea is that your mail and your spam are so different that simple word frequency analysis is more than enough to tell them apart. Further, the character of spam changes only slowly over time, so as long as the filter can keep up, it never falls out of date. To use it, run /mail/lib/setup.bayes. That will save a copy of your current pipeto, if you have one, and write a pipeto that invokes the bayes filter. Each user has their own profile of what is spam and what is normal mail. To start, setup.bayes copies profiles based on the spam we've collected over the last year and based on my own mailbox from a few months ago. These should be good enough to get you started, and then the profiles will adapt to you as you use the system. When mail comes in, it is classified. The result is logged in /mail/box/you/_bounced. If the message is deemed spam, it is delivered normally but marked with SPAM: at the beginning of the subject line. This is a good way to run the filter at first, when you're still training it. Once you're more confident in the filter's choices, you can edit $mail/pipeto to change what happens to spam. Change the line "spool-tagged-spam" to, say, "spool /mail/box/$USER/spam" to spool spam into a separate "spam" mbox. Occasionally you will want to "mail -f spam" to view the spam mail box and see if anything important has fallen into it. I've found that commercial email (like receipts from online stores) and spam are very hard for this filter to tell apart, so these almost always must be fished out of the spam box. Except for that, I get hardly any misclassified messages. It's fine to delete the spam that has accumulated. When the filter classifies a message as spam or not, it updates its internal profiles. Assuming it's making correct decisions, this helps it track changing message profiles. Of course, if it makes a wrong decision, updating the profile will compound that wrong decision, making it more likely next time. So you need to explicitly tell it about mistakes so that the mistakes don't compound. At a nedmail prompt after viewing the misclassified message, you can do one of ||upas/spam - note that message should be spam ||upas/unspam - note that message is NOT spam ||upas/isspam - show words used to make decision about whether message is spam For example, if you got a message about the Holmdel Coffee Hour, you might run ||upas/spam followed by ||upas/isspam to see if the profile had changed enough to mark that message as spam. If not, and you were feeling persistent, you might run ||upas/spam again and again until the profile had changed enough. Acme Mail users can type and then execute Spam, Unspam, or Isspam in the message tag to accomplish the same things. The spam filter is computationally a little expensive. It adds a second or two to the delivery of messages. Russ