caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Scott Alexander <salex@dsl.cis.upenn.edu>
To: caml-list@inria.fr
Cc: Richard Zidlicky <rz@linux-m68k.org>
Subject: Re: [Caml-list] posting policy and spam
Date: 04 Jan 2004 11:22:22 -0500	[thread overview]
Message-ID: <1073233342.1808.191.camel@localhost.localdomain> (raw)
In-Reply-To: <20040104102340.GA1795@linux-m68k.org>

On Sun, 2004-01-04 at 05:23, Richard Zidlicky wrote:
> On Sun, Jan 04, 2004 at 12:28:37AM +0100, Sven Luther wrote:
> > On Sat, Jan 03, 2004 at 10:24:49AM +0100, Xavier Leroy wrote:
> > > There have been several complains recently about spam getting through
> > > the caml-list.
> > > 
> > > For your information, the list is filtered through SpamOracle, and the
> > > posting address receives several hundred spams a day.  Due to spammers
> > > getting more clever, the efficiency of the filtering went from perfect
> > > to about 99%.  That's enough to let significant amounts of spam slip
> > > through.
> > 
> > Well, on a similar subject, is there any chance of implementing a
> > workaround in spamoracle to counter those spams specifically designed to
> > fool the bayesian filters ? You know, those who have 4 lines of random
> > words in a text attachement, and then some html spam. 
> > 
> > I don't know if the bayesian filters or a modification thereof is able
> > to counter this kind of email, but i don't think so.
> 
> n-grams should be able to cope with the random words. There is already
> at least one library at sf implementing them so I am not sure it is
> worth to reimplement it in spamoracle.

FWIW, I've found the Bayesian stuff to do pretty well even with random
words given enough training.  (I'm using spambayes if it matters.)  Most
of the random words they pick aren't in my common words list as it turns
out.  And so many of the words in their actual message are in my spam
list.  (Obviously, this isn't a correct statement of how the algorithm
actually works, but I think it gives the right idea.)  After reading
Paul Graham's look back on Bayesian filtering after a year,
(http://www.paulgraham.com/sofar.html), I looked more closely at how
some of my spam and ham were scoring.  Looking at the misspelling
approach, I current score "viagra" as 0.974978, "vi@gra" as 0.844828,
and "v1@gra" as 0.908163.

As for random words, looking through my list of messages to be trained,
I have a typical spam titled "Re: YGOCP, to the procurator".  With a
long list of random words and breaking up their message ("<p>O</rigid>ur
U</immature>S Li</prominent>censed Doc</shepherd>tors wi</calve>ll<BR>
Prescr</violate>ibes Y</esophagi>our Me</antonym>dication
F</eigenvector>or F</irreversible>ree"), it scores as 99.79%.  Not only
do they have some URL elements (like biz) which are high on my spam
list, but some of the random words have become spam identifiers (euclid,
metalwork, adequacy, bourgeoisie, cornish, rectilinear).  It did hit a
few on the ham list (oregon, weird, and laminar appear in spam for the
first time with this message), but not enough to be significant.

I do train on (almost) every message that I receive and have done so for
several months.  According to the statistics section I have "Total
emails trained: Spam: 3893 Ham: 12685".  And I am having a false
positive problem with Caml-list after the rash of spams.  It seems to be
getting close to being trained back, but Caml-list is a relatively low
volume list for me.

Anyway, enough nattering on.  I'm amazed by the Bayesian stuff and find
it interesting.

Best,
Scott
-- 
Scott Alexander <salex@dsl.cis.upenn.edu>

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


  reply	other threads:[~2004-01-04 16:22 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-01-03  9:24 Xavier Leroy
2004-01-03 11:37 ` Claudio Sacerdoti Coen
2004-01-03 13:34   ` Xavier Leroy
2004-01-03 23:28 ` Sven Luther
2004-01-04 10:23   ` Richard Zidlicky
2004-01-04 16:22     ` Scott Alexander [this message]
2004-01-04 13:40   ` Vitaly Lugovsky
2004-01-04 15:43   ` Damien Doligez
2004-01-03 12:20 Claudio Sacerdoti Coen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1073233342.1808.191.camel@localhost.localdomain \
    --to=salex@dsl.cis.upenn.edu \
    --cc=caml-list@inria.fr \
    --cc=rz@linux-m68k.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).