From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id RAA22120; Sun, 4 Jan 2004 17:22:26 +0100 (MET) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id RAA22893 for ; Sun, 4 Jan 2004 17:22:25 +0100 (MET) Received: from codex.cis.upenn.edu (CODEX.CIS.UPENN.EDU [158.130.6.15]) by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id i04GMOH12589 for ; Sun, 4 Jan 2004 17:22:24 +0100 (MET) Received: from localhost (localhost [127.0.0.1]) by codex.cis.upenn.edu (8.12.10/8.12.9) with ESMTP id i04GMFvL022844; Sun, 4 Jan 2004 11:22:15 -0500 (EST) Subject: Re: [Caml-list] posting policy and spam From: Scott Alexander To: caml-list@inria.fr Cc: Richard Zidlicky In-Reply-To: <20040104102340.GA1795@linux-m68k.org> References: <20040103102449.D31406@pauillac.inria.fr> <20040103232837.GA20552@iliana> <20040104102340.GA1795@linux-m68k.org> Content-Type: text/plain Message-Id: <1073233342.1808.191.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 (1.2.2-5) Date: 04 Jan 2004 11:22:22 -0500 Content-Transfer-Encoding: 7bit X-Loop: caml-list@inria.fr X-Spam: no; 0.00; caml-list:01 2004:99 2004:99 sven:01 luther:01 caml-list:01 spamoracle:01 spamoracle:01 bayesian:01 bayesian:01 fwiw:01 graham's:01 paulgraham:01 sofar:01 ham:99 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk On Sun, 2004-01-04 at 05:23, Richard Zidlicky wrote: > On Sun, Jan 04, 2004 at 12:28:37AM +0100, Sven Luther wrote: > > On Sat, Jan 03, 2004 at 10:24:49AM +0100, Xavier Leroy wrote: > > > There have been several complains recently about spam getting through > > > the caml-list. > > > > > > For your information, the list is filtered through SpamOracle, and the > > > posting address receives several hundred spams a day. Due to spammers > > > getting more clever, the efficiency of the filtering went from perfect > > > to about 99%. That's enough to let significant amounts of spam slip > > > through. > > > > Well, on a similar subject, is there any chance of implementing a > > workaround in spamoracle to counter those spams specifically designed to > > fool the bayesian filters ? You know, those who have 4 lines of random > > words in a text attachement, and then some html spam. > > > > I don't know if the bayesian filters or a modification thereof is able > > to counter this kind of email, but i don't think so. > > n-grams should be able to cope with the random words. There is already > at least one library at sf implementing them so I am not sure it is > worth to reimplement it in spamoracle. FWIW, I've found the Bayesian stuff to do pretty well even with random words given enough training. (I'm using spambayes if it matters.) Most of the random words they pick aren't in my common words list as it turns out. And so many of the words in their actual message are in my spam list. (Obviously, this isn't a correct statement of how the algorithm actually works, but I think it gives the right idea.) After reading Paul Graham's look back on Bayesian filtering after a year, (http://www.paulgraham.com/sofar.html), I looked more closely at how some of my spam and ham were scoring. Looking at the misspelling approach, I current score "viagra" as 0.974978, "vi@gra" as 0.844828, and "v1@gra" as 0.908163. As for random words, looking through my list of messages to be trained, I have a typical spam titled "Re: YGOCP, to the procurator". With a long list of random words and breaking up their message ("

Our US Licensed Doctors will
Prescribes Your Medication For Free"), it scores as 99.79%. Not only do they have some URL elements (like biz) which are high on my spam list, but some of the random words have become spam identifiers (euclid, metalwork, adequacy, bourgeoisie, cornish, rectilinear). It did hit a few on the ham list (oregon, weird, and laminar appear in spam for the first time with this message), but not enough to be significant. I do train on (almost) every message that I receive and have done so for several months. According to the statistics section I have "Total emails trained: Spam: 3893 Ham: 12685". And I am having a false positive problem with Caml-list after the rash of spams. It seems to be getting close to being trained back, but Caml-list is a relatively low volume list for me. Anyway, enough nattering on. I'm amazed by the Bayesian stuff and find it interesting. Best, Scott -- Scott Alexander ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners