Announcements and discussions for Gnus, the GNU Emacs Usenet newsreader
 help / color / mirror / Atom feed
* Re: spam-stat.el and 'Better Bayesian Filtering'
       [not found]   ` <87brlzlffr.fsf@gvdnet.dk>
@ 2004-04-13 19:33     ` Ted Zlatanov
  0 siblings, 0 replies; only message in thread
From: Ted Zlatanov @ 2004-04-13 19:33 UTC (permalink / raw)


On Sat, 10 Apr 2004, knightsofspamalot-factotum@gvdnet.dk wrote:

> Well, for my part I decided to give Spambayes a spin instead, and
> it's working marvelously. My immediate practical problem has thus
> been solved.

That's wonderful.

> From what I can see at a glance, most of the work that would need to
> be done on spam-stat.el to bring it up to date with respect to
> 'Better Bayesian Filtering' would be in the tokeniser, since more
> complex tokens are used in said article. Also, it would probably be
> a good idea to make it sensitive to multipart messages (which, from
> what I can see, it is not; it merely looks at the first 4 kB) so
> that it'll not be thrown totally off course when a message starts
> with, say, an attached image. But it would be much more rewarding
> for anybody wishing to work on this to simply read Paul Graham's
> article rather than have me make a fool of myself trying to repeat
> what he says much more eruditely than I probably could. :-)

A URL would have been helpful:
http://www.paulgraham.com/better.html

There, we find the following tokenizing rules:

   1. Case is preserved.

   2. Exclamation points are constituent characters.

   3. Periods and commas are constituents if they occur between two
      digits. This lets me get ip addresses and prices intact.

   4. A price range like $20-25 yields two tokens, $20 and $25.

   5. Tokens that occur within the To, From, Subject, and Return-Path
      lines, or within urls, get marked accordingly. E.g. ``foo'' in
      the Subject line becomes ``Subject*foo''. (The asterisk could be
      any character you don't allow as a constituent.)

The heart of these changes would have to be made in
spam-stat-buffer-words.  I am not very familiar with ELisp syntax
tables, but it does seem like some work on spam-stat-syntax-table and
on the function spam-stat-buffer-words itself is in order.

Ted


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2004-04-13 19:33 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <u57hdwhkv69.fsf@borg.cs.auc.dk>
     [not found] ` <4nr7v17yix.fsf@b2-15-1.bwh.harvard.edu>
     [not found]   ` <87brlzlffr.fsf@gvdnet.dk>
2004-04-13 19:33     ` spam-stat.el and 'Better Bayesian Filtering' Ted Zlatanov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).