9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] spam filtering fs
@ 2003-09-01 12:48 steve.simon
  2003-09-01 14:16 ` Fco.J.Ballesteros
  0 siblings, 1 reply; 7+ messages in thread
From: steve.simon @ 2003-09-01 12:48 UTC (permalink / raw)
  To: 9fans

Hi,

I'am starting to think about a spam filtering again.

I plan to use Paul Grahams ideas plus the changes
sugested by Gary Robinson; Basicially a Naive Baiesian
classifier.

I think its best to implement it as a filesystem which overlays
upas/fs which is transparent to valid email but opaque to spam,
this meanst the token frequency database remains in RAM, not
having to be reloaded to test each new email.

In order for the filter to learn the user must be able to classify
the few spams that slip through, so I propose to wstat() emails
to zero length before deletion if they are spam and simply delete
them if they are valid.

In the longer term I have some ideas about fingerprinting attached
images, adding tokens for "is html email", "has MS screen-saver attached"
etc.

Anyone any opinions on this approach?

-Steve


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] spam filtering fs
  2003-09-01 12:48 [9fans] spam filtering fs steve.simon
@ 2003-09-01 14:16 ` Fco.J.Ballesteros
  2003-09-01 14:19   ` David Presotto
  0 siblings, 1 reply; 7+ messages in thread
From: Fco.J.Ballesteros @ 2003-09-01 14:16 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 247 bytes --]

I'd prefer a plain filter that could be called form pipeto, and
regarding training, I'd prefer to be able to use a different program,
say, upas/spam -y <spammail, so I could run it from within my mail
reader. Since you asked for opinions... :-)

[-- Attachment #2: Type: message/rfc822, Size: 2395 bytes --]

From: steve.simon@snellwilcox.com
To: 9fans@cse.psu.edu
Subject: [9fans] spam filtering fs
Date: Mon, 1 Sep 2003 13:48:39 +0100
Message-ID: <b89f5f87b787fdcd291b8f80d2d87dd8@yourdomain.dom>

Hi,

I'am starting to think about a spam filtering again.

I plan to use Paul Grahams ideas plus the changes
sugested by Gary Robinson; Basicially a Naive Baiesian
classifier.

I think its best to implement it as a filesystem which overlays
upas/fs which is transparent to valid email but opaque to spam,
this meanst the token frequency database remains in RAM, not
having to be reloaded to test each new email.

In order for the filter to learn the user must be able to classify
the few spams that slip through, so I propose to wstat() emails
to zero length before deletion if they are spam and simply delete
them if they are valid.

In the longer term I have some ideas about fingerprinting attached
images, adding tokens for "is html email", "has MS screen-saver attached"
etc.

Anyone any opinions on this approach?

-Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] spam filtering fs
  2003-09-01 14:16 ` Fco.J.Ballesteros
@ 2003-09-01 14:19   ` David Presotto
  2003-09-01 14:30     ` Dan Cross
  0 siblings, 1 reply; 7+ messages in thread
From: David Presotto @ 2003-09-01 14:19 UTC (permalink / raw)
  To: 9fans

rsc already wrote and is using a bayesian filter via pipeto.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] spam filtering fs
  2003-09-01 14:19   ` David Presotto
@ 2003-09-01 14:30     ` Dan Cross
  0 siblings, 0 replies; 7+ messages in thread
From: Dan Cross @ 2003-09-01 14:30 UTC (permalink / raw)
  To: 9fans

> rsc already wrote and is using a bayesian filter via pipeto.

Is it in the distribution?

	- Dan C.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] spam filtering fs
  2003-09-01 14:41   ` Fco.J.Ballesteros
@ 2003-09-01 14:49     ` boyd, rounin
  0 siblings, 0 replies; 7+ messages in thread
From: boyd, rounin @ 2003-09-01 14:49 UTC (permalink / raw)
  To: 9fans

> You could assume that all mails that have not classified as spam, are not
> spam. I could then simply define a `S' script and run it from acme to
> let you know which ones are spam.

i used to do something like that on lunix:

    http://www.insultant.net/repo/dws.html

i _cannot_ use acme.  it's just not for me.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] spam filtering fs
  2003-09-01 14:36 ` steve.simon
@ 2003-09-01 14:41   ` Fco.J.Ballesteros
  2003-09-01 14:49     ` boyd, rounin
  0 siblings, 1 reply; 7+ messages in thread
From: Fco.J.Ballesteros @ 2003-09-01 14:41 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 250 bytes --]

You could assume that all mails that have not classified as spam, are not
spam. I could then simply define a `S' script and run it from acme to
let you know which ones are spam. But if rsc has one already done, we'd
better use his interface as is.

[-- Attachment #2: Type: message/rfc822, Size: 2293 bytes --]

From: steve.simon@snellwilcox.com
To: 9fans@cse.psu.edu
Subject: Re: [9fans] spam filtering fs
Date: Mon, 1 Sep 2003 15:36:56 +0100
Message-ID: <88fd25c4b29b0d3dcf8b2629d0259786@yourdomain.dom>

The problem with the "upas/spam -y <spammail" approach is that the
word frequency tables need to be kept up to date for the best results,
and making the user type this in is unlikely to do this.

If you forget to add a new strain of spam or valid emails to the tables they
will not continue to learn and thus the threshold of spamminess used will
drift from the optimum - you will start to lose valid emails or get more spam.

An important feature of a user interface for baieien spam classifiers is
the equal ease with which the user can submit spam and valid emails to
the training code, otherwise the whole system becomes biased in favour
of least effort for the user :-)

-Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [9fans] spam filtering fs
       [not found] <758707399@snellwilcox.com>
@ 2003-09-01 14:36 ` steve.simon
  2003-09-01 14:41   ` Fco.J.Ballesteros
  0 siblings, 1 reply; 7+ messages in thread
From: steve.simon @ 2003-09-01 14:36 UTC (permalink / raw)
  To: 9fans

The problem with the "upas/spam -y <spammail" approach is that the
word frequency tables need to be kept up to date for the best results,
and making the user type this in is unlikely to do this.

If you forget to add a new strain of spam or valid emails to the tables they
will not continue to learn and thus the threshold of spamminess used will
drift from the optimum - you will start to lose valid emails or get more spam.

An important feature of a user interface for baieien spam classifiers is
the equal ease with which the user can submit spam and valid emails to
the training code, otherwise the whole system becomes biased in favour
of least effort for the user :-)

-Steve


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-09-01 14:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-09-01 12:48 [9fans] spam filtering fs steve.simon
2003-09-01 14:16 ` Fco.J.Ballesteros
2003-09-01 14:19   ` David Presotto
2003-09-01 14:30     ` Dan Cross
     [not found] <758707399@snellwilcox.com>
2003-09-01 14:36 ` steve.simon
2003-09-01 14:41   ` Fco.J.Ballesteros
2003-09-01 14:49     ` boyd, rounin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).