From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <2b7ff40e59acfd34ede485505064c34a@plan9.escet.urjc.es> From: Fco.J.Ballesteros To: 9fans@cse.psu.edu Subject: Re: [9fans] spam filtering fs In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="upas-cenawdsonvhmkmkxrynqzyjouy" Date: Mon, 1 Sep 2003 16:16:51 +0200 Topicbox-Message-UUID: 27c8e5f0-eacc-11e9-9e20-41e7f4b1d025 This is a multi-part message in MIME format. --upas-cenawdsonvhmkmkxrynqzyjouy Content-Disposition: inline Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit I'd prefer a plain filter that could be called form pipeto, and regarding training, I'd prefer to be able to use a different program, say, upas/spam -y ; Mon, 1 Sep 2003 08:48:40 -0400 (EDT) Message-ID: From: steve.simon@snellwilcox.com To: 9fans@cse.psu.edu MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: [9fans] spam filtering fs Sender: 9fans-admin@cse.psu.edu Errors-To: 9fans-admin@cse.psu.edu X-BeenThere: 9fans@cse.psu.edu X-Mailman-Version: 2.0.11 Precedence: bulk Reply-To: 9fans@cse.psu.edu List-Id: Fans of the OS Plan 9 from Bell Labs <9fans.cse.psu.edu> List-Archive: Date: Mon, 1 Sep 2003 13:48:39 +0100 X-Spam-Status: No, hits=1.4 required=5.0 tests=NO_REAL_NAME,RCVD_IN_OSIRUSOFT_COM version=2.55 X-Spam-Level: * X-Spam-Checker-Version: SpamAssassin 2.55 (1.174.2.19-2003-05-19-exp) Hi, I'am starting to think about a spam filtering again. I plan to use Paul Grahams ideas plus the changes sugested by Gary Robinson; Basicially a Naive Baiesian classifier. I think its best to implement it as a filesystem which overlays upas/fs which is transparent to valid email but opaque to spam, this meanst the token frequency database remains in RAM, not having to be reloaded to test each new email. In order for the filter to learn the user must be able to classify the few spams that slip through, so I propose to wstat() emails to zero length before deletion if they are spam and simply delete them if they are valid. In the longer term I have some ideas about fingerprinting attached images, adding tokens for "is html email", "has MS screen-saver attached" etc. Anyone any opinions on this approach? -Steve --upas-cenawdsonvhmkmkxrynqzyjouy--