From mboxrd@z Thu Jan  1 00:00:00 1970
Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id RAA22120; Sun, 4 Jan 2004 17:22:26 +0100 (MET)
X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id RAA22893 for <caml-list@pauillac.inria.fr>; Sun, 4 Jan 2004 17:22:25 +0100 (MET)
Received: from codex.cis.upenn.edu (CODEX.CIS.UPENN.EDU [158.130.6.15])
	by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id i04GMOH12589
	for <caml-list@inria.fr>; Sun, 4 Jan 2004 17:22:24 +0100 (MET)
Received: from localhost (localhost [127.0.0.1])
	by codex.cis.upenn.edu (8.12.10/8.12.9) with ESMTP id i04GMFvL022844;
	Sun, 4 Jan 2004 11:22:15 -0500 (EST)
Subject: Re: [Caml-list] posting policy and spam
From: Scott Alexander <salex@dsl.cis.upenn.edu>
To: caml-list@inria.fr
Cc: Richard Zidlicky <rz@linux-m68k.org>
In-Reply-To: <20040104102340.GA1795@linux-m68k.org>
References: <20040103102449.D31406@pauillac.inria.fr>
	 <20040103232837.GA20552@iliana>  <20040104102340.GA1795@linux-m68k.org>
Content-Type: text/plain
Message-Id: <1073233342.1808.191.camel@localhost.localdomain>
Mime-Version: 1.0
X-Mailer: Ximian Evolution 1.2.2 (1.2.2-5) 
Date: 04 Jan 2004 11:22:22 -0500
Content-Transfer-Encoding: 7bit
X-Loop: caml-list@inria.fr
X-Spam: no; 0.00; caml-list:01 2004:99 2004:99 sven:01 luther:01 caml-list:01 spamoracle:01 spamoracle:01 bayesian:01 bayesian:01 fwiw:01 graham's:01 paulgraham:01 sofar:01 ham:99 
Sender: owner-caml-list@pauillac.inria.fr
Precedence: bulk

On Sun, 2004-01-04 at 05:23, Richard Zidlicky wrote:
> On Sun, Jan 04, 2004 at 12:28:37AM +0100, Sven Luther wrote:
> > On Sat, Jan 03, 2004 at 10:24:49AM +0100, Xavier Leroy wrote:
> > > There have been several complains recently about spam getting through
> > > the caml-list.
> > > 
> > > For your information, the list is filtered through SpamOracle, and the
> > > posting address receives several hundred spams a day.  Due to spammers
> > > getting more clever, the efficiency of the filtering went from perfect
> > > to about 99%.  That's enough to let significant amounts of spam slip
> > > through.
> > 
> > Well, on a similar subject, is there any chance of implementing a
> > workaround in spamoracle to counter those spams specifically designed to
> > fool the bayesian filters ? You know, those who have 4 lines of random
> > words in a text attachement, and then some html spam. 
> > 
> > I don't know if the bayesian filters or a modification thereof is able
> > to counter this kind of email, but i don't think so.
> 
> n-grams should be able to cope with the random words. There is already
> at least one library at sf implementing them so I am not sure it is
> worth to reimplement it in spamoracle.

FWIW, I've found the Bayesian stuff to do pretty well even with random
words given enough training.  (I'm using spambayes if it matters.)  Most
of the random words they pick aren't in my common words list as it turns
out.  And so many of the words in their actual message are in my spam
list.  (Obviously, this isn't a correct statement of how the algorithm
actually works, but I think it gives the right idea.)  After reading
Paul Graham's look back on Bayesian filtering after a year,
(http://www.paulgraham.com/sofar.html), I looked more closely at how
some of my spam and ham were scoring.  Looking at the misspelling
approach, I current score "viagra" as 0.974978, "vi@gra" as 0.844828,
and "v1@gra" as 0.908163.

As for random words, looking through my list of messages to be trained,
I have a typical spam titled "Re: YGOCP, to the procurator".  With a
long list of random words and breaking up their message ("<p>O</rigid>ur
U</immature>S Li</prominent>censed Doc</shepherd>tors wi</calve>ll<BR>
Prescr</violate>ibes Y</esophagi>our Me</antonym>dication
F</eigenvector>or F</irreversible>ree"), it scores as 99.79%.  Not only
do they have some URL elements (like biz) which are high on my spam
list, but some of the random words have become spam identifiers (euclid,
metalwork, adequacy, bourgeoisie, cornish, rectilinear).  It did hit a
few on the ham list (oregon, weird, and laminar appear in spam for the
first time with this message), but not enough to be significant.

I do train on (almost) every message that I receive and have done so for
several months.  According to the statistics section I have "Total
emails trained: Spam: 3893 Ham: 12685".  And I am having a false
positive problem with Caml-list after the rash of spams.  It seems to be
getting close to being trained back, but Caml-list is a relatively low
volume list for me.

Anyway, enough nattering on.  I'm amazed by the Bayesian stuff and find
it interesting.

Best,
Scott
-- 
Scott Alexander <salex@dsl.cis.upenn.edu>

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners