From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: From: "Russ Cox" To: 9fans@cse.psu.edu MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: [9fans] bayesian spam filtering Date: Mon, 1 Sep 2003 10:53:53 -0400 Topicbox-Message-UUID: 28223e66-eacc-11e9-9e20-41e7f4b1d025 I've spent a few days here and there working on it, and it's easy to get something half-decent but hard to get something that's really quite good. If you're interested in what I've done already, I just packaged up all the pieces (I think) and put them in /n/sources/extra/spam.tar.gz. See /n/sources/extra/spam.notes for the list of files I included as well as a note about how to use it that I wrote but never circulated. I think my stuff is close -- the biggest wart is that it uses Berkeley DB to store the frequency tables. I think that might be necessary, but I'm not really sure. Tokenization is hard. Msgcat does a reasonable job. I've actually been using Mozilla Thunderbird to process my mail mostly (though I'm typing this in nedmail). The one feature it has that I really like is the spam handling. It could be better in some ways, but the spam interface is much better than what I built. When mail comes in, it's automatically flagged as spam or not. You can then correct its mistakes (if any) by just clicking on lines in the message listing. Then once you're happy you click "Delete Mail Marked as Junk" and away they go. If acme and nedmail had a way way to select disjoint sets of messages, I think that would help a lot. I've found that although Thunderbird's filtering is not as accurate as what I built, the interface makes up for it. Notice I said process, not read. I still read with nedmail and Mail, at least when I'm in Plan 9, but to do things like kill off the spam and manage lots of different mail boxes, the external mail readers seem like a big win. Sad but true. Russ