caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Announcement: SpamOracle
@ 2002-08-26 13:11 Xavier Leroy
  2002-08-26 14:56 ` fred
  2002-10-20 10:43 ` Sven Luther
  0 siblings, 2 replies; 11+ messages in thread
From: Xavier Leroy @ 2002-08-26 13:11 UTC (permalink / raw)
  To: caml-list

[ Spam filtering is rather off-topic for this list, but this tool is
  written in OCaml, and needs adventurous users of the kind found on
  this list to be tested... ]

Are you tired with spam messages cluttering your e-mail?  Are you
retentive enough to have meticulously archived all this spam,
separately from your regular e-mail, in the hope that someday some
program might learn by example how to recognize spam -- something that
any human does in a fraction of second?

If so, rejoice!  The time has come!

Introducing the Spam Oracle, a.k.a. "Saint Peter".  From the README:

  SpamOracle is a BiCapitalized tool to help detect and filter away "spam"
  (unsolicited commercial e-mail).  It proceeds by statistical analysis
  of the words that appear in the e-mail, comparing the frequencies of
  words with those found in a user-provided corpus of known spam and
  known legitimate e-mail.  The classification algorithm is based on
  Bayes' formula, and is described in Paul Graham's paper, "A plan for
  spam", http://www.paulgraham.com/spam.html.  

  This program is designed to work in conjunction with procmail.
  The result of the analysis is output as an additional message header
  "X-Spam:", followed by "yes", "no" or "unknown", plus additional
  details.  A procmail rule can then test this "X-Spam:" header and
  deliver the e-mail to the appropriate mailbox.

  In addition, SpamOracle also also analyses MIME attachments,
  extracting relevant information such as MIME type, character encoding
  and attached file name, and summarizing them in an additional
  "X-Attachments:" header.  This allows procmail to easily reject
  e-mails containing suspicious attachments, e.g. Windows executables
  which often indicate a virus.

All for the incredibly low price of $0.00 !  But hurry -- it won't
last long!  Grab your copy from 
    http://pauillac.inria.fr/~xleroy/software.html#spamoracle

This AMAZING product, based on college-level statistical theory less
than one century old, will just REVOLUTIONIZE your life!  Or your
money back!  But don't take my words for it!  See the testimonials
from our happy customers!  Xavier L., from Versailles: "What with all
these e-mails about V**gra, p*nis enlargement, and hot teen*ge sl*ts,
I just couldn't concentrate on writing quality software and research
papers.  Your product just restored peace in my mailbox and in my
mind, too.  Now, I can spend whole days watching my procmail log
and bursting into hysterical laughter every time another spam bites
the dust!  Why, it feels so good, I think I'll subscribe to some of
these "one-time use" mailing lists JUST FOR FUN!"

Enjoy,

- Xavier Leroy
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] Announcement: SpamOracle
  2002-08-26 13:11 [Caml-list] Announcement: SpamOracle Xavier Leroy
@ 2002-08-26 14:56 ` fred
  2002-10-20 10:43 ` Sven Luther
  1 sibling, 0 replies; 11+ messages in thread
From: fred @ 2002-08-26 14:56 UTC (permalink / raw)
  Cc: caml-list

On Mon, Aug 26, 2002 at 03:11:38PM +0200, Xavier Leroy wrote:
> [ Spam filtering is rather off-topic for this list, but this tool is
>   written in OCaml, and needs adventurous users of the kind found on
>   this list to be tested... ]

Cool!  I've been messing with Eric Raymond's "bogofilter", the first
"Bayesian spam filter" (meme of the week, it seems) implementation
that I could find, but I'm looking forward to studying your
implementation as an example of a practical OCaml application.

Raymond is providing patches to mutt that make it easy (supposedly) to
add messages to the database of good/bad messages in the course of
normal mail reading activity.  The same hooks might be useful for
adding to an OCaml database instead.

-- 
Fred Yankowski      fred@ontosys.com           tel: +1.630.879.1312
OntoSys, Inc	    PGP keyID: 7B449345        fax: +1.630.879.1370
www.ontosys.com     38W242 Deerpath Rd, Batavia, IL 60510-9461, USA
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] Announcement: SpamOracle
  2002-08-26 13:11 [Caml-list] Announcement: SpamOracle Xavier Leroy
  2002-08-26 14:56 ` fred
@ 2002-10-20 10:43 ` Sven Luther
  2002-10-20 20:49   ` Stefano Zacchiroli
  2002-10-21 12:46   ` Xavier Leroy
  1 sibling, 2 replies; 11+ messages in thread
From: Sven Luther @ 2002-10-20 10:43 UTC (permalink / raw)
  To: Xavier Leroy; +Cc: caml-list

On Mon, Aug 26, 2002 at 03:11:38PM +0200, Xavier Leroy wrote:
> [ Spam filtering is rather off-topic for this list, but this tool is
>   written in OCaml, and needs adventurous users of the kind found on
>   this list to be tested... ]
> 
> Are you tired with spam messages cluttering your e-mail?  Are you
> retentive enough to have meticulously archived all this spam,
> separately from your regular e-mail, in the hope that someday some
> program might learn by example how to recognize spam -- something that
> any human does in a fraction of second?
> 
> If so, rejoice!  The time has come!

Xavier, ...

I am packaging spamoracle for debian right now. It still lacks a manpage,
would you care to provide one, or should i do it ?

That said, what i really wanted to know, is if you have some idea of how
spamoracle would scale in case of heavy load, if you use it to filter
mailing lists input for example ? For example, do you use it to filter
the ocaml mailing lists or something such ? Or do you think it would be
possible to filter the debian mailing lists and not have the mailserver
overload or something such ?

Friendly,

Sven Luther
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] Announcement: SpamOracle
  2002-10-20 10:43 ` Sven Luther
@ 2002-10-20 20:49   ` Stefano Zacchiroli
  2002-10-20 21:01     ` Jérôme Marant
  2002-10-21 12:46   ` Xavier Leroy
  1 sibling, 1 reply; 11+ messages in thread
From: Stefano Zacchiroli @ 2002-10-20 20:49 UTC (permalink / raw)
  To: Sven Luther; +Cc: Xavier Leroy, caml-list

On Sun, Oct 20, 2002 at 12:43:54PM +0200, Sven Luther wrote:
> That said, what i really wanted to know, is if you have some idea of how
> spamoracle would scale in case of heavy load, if you use it to filter
> mailing lists input for example ? For example, do you use it to filter
> the ocaml mailing lists or something such ? Or do you think it would be
> possible to filter the debian mailing lists and not have the mailserver
> overload or something such ?

BTW, have you performed any comparison with spamassassin?

Cheers.

-- 
Stefano Zacchiroli - undergraduate student of CS @ Univ. Bologna, Italy
zack@cs.unibo.it | ICQ# 33538863 | http://www.cs.unibo.it/~zacchiro
"I know you believe you understood what you think I said, but I am not
sure you realize that what you heard is not what I meant!" -- G.Romney
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] Announcement: SpamOracle
  2002-10-20 20:49   ` Stefano Zacchiroli
@ 2002-10-20 21:01     ` Jérôme Marant
  2002-10-21  9:37       ` Markus Mottl
  2002-10-21 11:51       ` Claude Marche
  0 siblings, 2 replies; 11+ messages in thread
From: Jérôme Marant @ 2002-10-20 21:01 UTC (permalink / raw)
  To: caml-list

Stefano Zacchiroli <zack@cs.unibo.it> writes:

> On Sun, Oct 20, 2002 at 12:43:54PM +0200, Sven Luther wrote:
>> That said, what i really wanted to know, is if you have some idea of how
>> spamoracle would scale in case of heavy load, if you use it to filter
>> mailing lists input for example ? For example, do you use it to filter
>> the ocaml mailing lists or something such ? Or do you think it would be
>> possible to filter the debian mailing lists and not have the mailserver
>> overload or something such ?
>
> BTW, have you performed any comparison with spamassassin?

Hi,

I've already tried spamoracle: I fed it with about 2000 spams and
3000 good mails and it too often considered good mail as spam.

Cheers,
 

-- 
Jérôme Marant

http://marant.org
              
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] Announcement: SpamOracle
  2002-10-20 21:01     ` Jérôme Marant
@ 2002-10-21  9:37       ` Markus Mottl
  2002-10-21 10:12         ` Jérôme Marant
  2002-10-21 11:51       ` Claude Marche
  1 sibling, 1 reply; 11+ messages in thread
From: Markus Mottl @ 2002-10-21  9:37 UTC (permalink / raw)
  To: Jérôme Marant; +Cc: caml-list

On Sun, 20 Oct 2002, Jérôme Marant wrote:
> I've already tried spamoracle: I fed it with about 2000 spams and 3000
> good mails and it too often considered good mail as spam.

To add my experience with spamoracle, I use it on a regular basis and am
very content with its performance. Of course, this is really a matter
of what kind of e-mail you usually get. It has happened only about
4 times so far (since end of August - I get about 30 mails per day)
that it misclassified admittedly "strange looking" e-mail (no contents,
only attachments). I have trained it using about 1000 spam and 10000
good mails.

Though this is probably quite obvious anyway, I'd like to point out that
it is really important that all of the good and spam mails are the ones
that you have personally received. If you just take any kind of spam or
good mails, performance will definitely suffer.

If you absolutely don't want to miss good mails, you'll have to regularly
look at your spam folder. Even in this case spamoracle is very helpful,
because it decreases total entropy, i.e. makes it easier for you to
classify things with your own eyes.

Regards,
Markus Mottl

-- 
Markus Mottl                                             markus@oefai.at
Austrian Research Institute
for Artificial Intelligence                  http://www.oefai.at/~markus
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] Announcement: SpamOracle
  2002-10-21  9:37       ` Markus Mottl
@ 2002-10-21 10:12         ` Jérôme Marant
  0 siblings, 0 replies; 11+ messages in thread
From: Jérôme Marant @ 2002-10-21 10:12 UTC (permalink / raw)
  To: caml-list

Markus Mottl <markus@oefai.at> writes:

> Though this is probably quite obvious anyway, I'd like to point out that
> it is really important that all of the good and spam mails are the ones
> that you have personally received. If you just take any kind of spam or
> good mails, performance will definitely suffer.

  This is exactly what I did.

> If you absolutely don't want to miss good mails, you'll have to regularly
> look at your spam folder. Even in this case spamoracle is very helpful,

  Of course, I always have a look to the spam box in order not
  to remove good mails.

> because it decreases total entropy, i.e. makes it easier for you to
> classify things with your own eyes.

  I won't spend my time on such experiments anyway. Spamassassin works
  fine for me with no addtional configuration.

  Cheers,

-- 
Jérôme Marant

http://marant.org
              
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] Announcement: SpamOracle
  2002-10-20 21:01     ` Jérôme Marant
  2002-10-21  9:37       ` Markus Mottl
@ 2002-10-21 11:51       ` Claude Marche
  2002-10-21 12:27         ` Jérôme Marant
  1 sibling, 1 reply; 11+ messages in thread
From: Claude Marche @ 2002-10-21 11:51 UTC (permalink / raw)
  To: Jérôme Marant; +Cc: caml-list


>>>>> "Jérôme" == Jérôme Marant <jmarant@nerim.net> writes:

    Jérôme> Stefano Zacchiroli <zack@cs.unibo.it> writes:
    >> On Sun, Oct 20, 2002 at 12:43:54PM +0200, Sven Luther wrote:
    >>> That said, what i really wanted to know, is if you have some idea of how
    >>> spamoracle would scale in case of heavy load, if you use it to filter
    >>> mailing lists input for example ? For example, do you use it to filter
    >>> the ocaml mailing lists or something such ? Or do you think it would be
    >>> possible to filter the debian mailing lists and not have the mailserver
    >>> overload or something such ?
    >> 
    >> BTW, have you performed any comparison with spamassassin?

    Jérôme> Hi,

    Jérôme> I've already tried spamoracle: I fed it with about 2000 spams and
    Jérôme> 3000 good mails and it too often considered good mail as spam.

Hi,

I use Spamoracle almost since it has been announced. Before, I was
using SpamAssassin. Currently, my Spamoracle database contains roughly
20000 good mails and 1000 spams (not including asiatic language spams
which are filtered differently).

Now, I usually get 0 or 1 spam per day not filtered, usually because
there are written in french and my database is not large enough for
those. I check my spamoracle folder some time to time, I had almost no
good mail classified as spam, and if I get one, I immediately move the
mail in a `good' folder and rebuild the database. I suggest you should
check to way you built your database, may be you made some mistakes. 

With respect to SpamAssassin, SpamOracle runs much faster, this would
not surprise anyone here since SpamAssassin is a perl
script. Moreover, I had problems with SpamAssassin because I receive
my mails on several machines, not running the very same version of
perl, that sometime leads to runtime error in execution of
SpamAssassin. 

Finally, one should be aware that the filtering methods of
SpamAssassin and SpamOracle are very different, and I like very much
the idea, in SpamOracle, that the filter should be tuned by the user personal
idea of what is a spam. I recommend reading Paul Graham's paper
(http://www.paulgraham.com/spam.html) on which SpamOracle filter
method is based.

I wish you a happy spam filtering !

- Claude


-- 
| Claude Marché           | mailto:Claude.Marche@lri.fr |
| LRI - Bât. 490          | http://www.lri.fr/~marche/  |
| Université de Paris-Sud | phoneto: +33 1 69 15 64 85  |
| F-91405 ORSAY Cedex     | faxto: +33 1 69 15 65 86    |
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] Announcement: SpamOracle
  2002-10-21 11:51       ` Claude Marche
@ 2002-10-21 12:27         ` Jérôme Marant
  0 siblings, 0 replies; 11+ messages in thread
From: Jérôme Marant @ 2002-10-21 12:27 UTC (permalink / raw)
  To: caml-list

Claude Marche <Claude.Marche@lri.fr> writes:

>     Jérôme> Hi,
>
>     Jérôme> I've already tried spamoracle: I fed it with about 2000 spams and
>     Jérôme> 3000 good mails and it too often considered good mail as spam.
>
> Hi,

Hi,

> I use Spamoracle almost since it has been announced. Before, I was
> using SpamAssassin. Currently, my Spamoracle database contains roughly
> 20000 good mails and 1000 spams (not including asiatic language spams
> which are filtered differently).
>
> Now, I usually get 0 or 1 spam per day not filtered, usually because
> there are written in french and my database is not large enough for
> those. I check my spamoracle folder some time to time, I had almost no
> good mail classified as spam, and if I get one, I immediately move the
> mail in a `good' folder and rebuild the database. I suggest you should
> check to way you built your database, may be you made some mistakes. 

Maybe.

> With respect to SpamAssassin, SpamOracle runs much faster, this would
> not surprise anyone here since SpamAssassin is a perl
> script. Moreover, I had problems with SpamAssassin because I receive

It depends how you use SpamAssassin :-) Using it as a daemon (spamd)
along with spamc is pretty fast.

> Finally, one should be aware that the filtering methods of
> SpamAssassin and SpamOracle are very different, and I like very much
> the idea, in SpamOracle, that the filter should be tuned by the user personal
> idea of what is a spam. I recommend reading Paul Graham's paper
> (http://www.paulgraham.com/spam.html) on which SpamOracle filter
> method is based.

Well, I know. The same algorithm is used in bogofilter.

> I wish you a happy spam filtering !

:-)

-- 
Jérôme Marant

http://marant.org
              
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] Announcement: SpamOracle
  2002-10-20 10:43 ` Sven Luther
  2002-10-20 20:49   ` Stefano Zacchiroli
@ 2002-10-21 12:46   ` Xavier Leroy
  2002-10-25  7:57     ` Michael Sperber [Mr.  Preprocessor]
  1 sibling, 1 reply; 11+ messages in thread
From: Xavier Leroy @ 2002-10-21 12:46 UTC (permalink / raw)
  To: Sven Luther; +Cc: caml-list

I suggest we move this discussion off the Caml list, as it's not
really relevant to Caml.  Briefly:

> That said, what i really wanted to know, is if you have some idea of how
> spamoracle would scale in case of heavy load, if you use it to filter
> mailing lists input for example ? For example, do you use it to filter
> the ocaml mailing lists or something such ?

We use SpamOracle to filter caml-bugs@inria.fr, with excellent
results.  caml-list@inria.fr is filtered by the "must be subscribed to
post" policy of Majordomo.

Generally speaking, SpamOracle runs faster than SpamAssassin, so if
your e-mail system withstand the latter, it should withstand the
former.  Of course, if your mailserver receives 10000 messages a day,
even SpamOracle could be computationally too expensive.

- Xavier Leroy
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] Announcement: SpamOracle
  2002-10-21 12:46   ` Xavier Leroy
@ 2002-10-25  7:57     ` Michael Sperber [Mr.  Preprocessor]
  0 siblings, 0 replies; 11+ messages in thread
From: Michael Sperber [Mr.  Preprocessor] @ 2002-10-25  7:57 UTC (permalink / raw)
  To: Xavier Leroy; +Cc: Sven Luther, caml-list

>>>>> "Xavier" == Xavier Leroy <xavier.leroy@inria.fr> writes:

Xavier> I suggest we move this discussion off the Caml list, as it's not
Xavier> really relevant to Caml.  Briefly:

>> That said, what i really wanted to know, is if you have some idea of how
>> spamoracle would scale in case of heavy load, if you use it to filter
>> mailing lists input for example ? For example, do you use it to filter
>> the ocaml mailing lists or something such ?

Xavier> We use SpamOracle to filter caml-bugs@inria.fr, with excellent
Xavier> results.  caml-list@inria.fr is filtered by the "must be subscribed to
Xavier> post" policy of Majordomo.

Xavier> Generally speaking, SpamOracle runs faster than SpamAssassin, so if
Xavier> your e-mail system withstand the latter, it should withstand the
Xavier> former.  Of course, if your mailserver receives 10000 messages a day,
Xavier> even SpamOracle could be computationally too expensive.

For reference, I run the byte-code version of SpamOracle on our IBM
320 mail servers.  I get on the order of 1000 messages/day which run
through SpamOracle.  The IBM 320 is an over-ten-year old desktop AIX
machine, comparable about to a 486/33.  (And it handles the email of
the rest of the faculty as well, including several other SpamOracle
users.)  So any modern-hardware mailserver should be able to handle
10000/day easily.

I love SpamOracle.

BTW. the latest issue of the German "Spiegel" news magazine has an
article about Bayesian SPAM filters, quoting Paul Graham as saying
that the first one to offer a practical implementation of Bayes will
surely become very rich.  By my book, Xavier certainly deservers it.

-- 
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2002-10-25  7:57 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-26 13:11 [Caml-list] Announcement: SpamOracle Xavier Leroy
2002-08-26 14:56 ` fred
2002-10-20 10:43 ` Sven Luther
2002-10-20 20:49   ` Stefano Zacchiroli
2002-10-20 21:01     ` Jérôme Marant
2002-10-21  9:37       ` Markus Mottl
2002-10-21 10:12         ` Jérôme Marant
2002-10-21 11:51       ` Claude Marche
2002-10-21 12:27         ` Jérôme Marant
2002-10-21 12:46   ` Xavier Leroy
2002-10-25  7:57     ` Michael Sperber [Mr.  Preprocessor]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).