Paul Graham on fighting SPAM

Gnus development mailing list
 help / color / mirror / Atom feed

* Paul Graham on fighting SPAM
@ 2002-08-16 17:10 Danny Siu
  2002-08-17 19:43 ` Kai Großjohann
  2002-08-19  9:23 ` Paul Graham on fighting SPAM Alex Schroeder
  0 siblings, 2 replies; 35+ messages in thread
From: Danny Siu @ 2002-08-16 17:10 UTC (permalink / raw)



since we had much discussion on spams lately, it is worthwhile to see read
about what lisp guru thinks the content based filters can effectively kill spams.

<URL:http://www.paulgraham.com/spam.html>

-- 
Danny Siu




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-16 17:10 Paul Graham on fighting SPAM Danny Siu
@ 2002-08-17 19:43 ` Kai Großjohann
  2002-08-19  5:44   ` Paul Jarc
  2002-08-19  9:23 ` Paul Graham on fighting SPAM Alex Schroeder
  1 sibling, 1 reply; 35+ messages in thread
From: Kai Großjohann @ 2002-08-17 19:43 UTC (permalink / raw)

Danny Siu <dsiu@adobe.com> writes:

> since we had much discussion on spams lately, it is worthwhile to
> see read about what lisp guru thinks the content based filters can
> effectively kill spams.

He has clearly seen the light :-)

There is a research field known as "information filtering" or
"(automatic) text classification" or "text categorization".  I don't
know the details of the theory, but folks in that community are
speaking of "naive Bayes classifiers" as one of the ways to do it --
maybe that's similar to his approach.  Other buzzwords that come to
my mind are kNN (k nearest neighbor) and support vector machines.
I'm not an expert in that field, but the numbers given by people who
talk about the effectiveness (quality) of text classifiers are quite
good, they are above 70% usually.  On much harder problems, that
is -- recognizing spam should be a no-brainer.

Maybe ShengHuo knows more and can elaborate.  I'm not an expert, just
aware that the field exists.

kai
-- 
A large number of young women don't trust men with beards.  (BFBS Radio)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-17 19:43 ` Kai Großjohann
@ 2002-08-19  5:44   ` Paul Jarc
  2002-08-19  8:53     ` Kai Großjohann
                       ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Paul Jarc @ 2002-08-19  5:44 UTC (permalink / raw)

Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) wrote:
> There is a research field known as "information filtering" or
> "(automatic) text classification" or "text categorization".  I don't
> know the details of the theory, but folks in that community are
> speaking of "naive Bayes classifiers" as one of the ways to do it --
> maybe that's similar to his approach.

Sounds like it.  Anyone know if this (or another) method generalizes
to more than two categories (spam/nonspam)?  If so, it could be used
for all mail splitting.  We wouldn't have to manually craft split
rules; we'd just seed a new group with the mails we have so far that
belong there, and their contents would let the computer guess which
new mails belong with them.

paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-19  5:44   ` Paul Jarc
@ 2002-08-19  8:53     ` Kai Großjohann
  2002-08-21  1:14       ` news
  2002-08-19 10:50     ` Oliver Scholz
  2002-08-19 14:12     ` Email filing Scott A Crosby
  2 siblings, 1 reply; 35+ messages in thread
From: Kai Großjohann @ 2002-08-19  8:53 UTC (permalink / raw)


prj@po.cwru.edu (Paul Jarc) writes:

> Sounds like it.  Anyone know if this (or another) method generalizes
> to more than two categories (spam/nonspam)?  If so, it could be used
> for all mail splitting.  We wouldn't have to manually craft split
> rules; we'd just seed a new group with the mails we have so far that
> belong there, and their contents would let the computer guess which
> new mails belong with them.

I think you can assume that text classifiers can choose one out of N
categories.  In some cases, this is implemented by having N yes/no
classifiers and choosing the result from the one with the highest
confidence value, but this is only an implementation detail.

Using automatic classification for splitting would really be way
cool.  Are there any mail readers which do this?

kai
-- 
A large number of young women don't trust men with beards.  (BFBS Radio)



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-16 17:10 Paul Graham on fighting SPAM Danny Siu
  2002-08-17 19:43 ` Kai Großjohann
@ 2002-08-19  9:23 ` Alex Schroeder
  2002-08-19 11:29   ` Ted Zlatanov
  1 sibling, 1 reply; 35+ messages in thread
From: Alex Schroeder @ 2002-08-19  9:23 UTC (permalink / raw)

Danny Siu <dsiu@adobe.com> writes:

> since we had much discussion on spams lately, it is worthwhile to see read
> about what lisp guru thinks the content based filters can effectively kill spams.
>
> <URL:http://www.paulgraham.com/spam.html>

I posted some code to g.e.sources to implement the basics.  If anybody
feels like fooling around with it, I'd be happy to read about it.
There's also a comment by Kai on it in g.e.help.

* http://www.emacswiki.org/cgi-bin/wiki.pl?SpamStat

Things I'd like to see: Efficient storage and retrieval of the data
from disk.  Based on 3351 mails, 298 of them being spam, I got a
dictionary of 650k; preparing it used an intermediary file of 7m.
Once saving is fast, I'd like to update the stats as we go along to
avoid the long preparation times.  Updating the stats requires the
original 7m of data, however.  So before delving into all of this, I'd
prefer to see wether it works, see what other people think, collect
some ideas and patches...

Alex.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-19  5:44   ` Paul Jarc
  2002-08-19  8:53     ` Kai Großjohann
@ 2002-08-19 10:50     ` Oliver Scholz
  2002-08-19 11:06       ` Kai Großjohann
  2002-08-19 14:12     ` Email filing Scott A Crosby
  2 siblings, 1 reply; 35+ messages in thread
From: Oliver Scholz @ 2002-08-19 10:50 UTC (permalink / raw)


prj@po.cwru.edu (Paul Jarc) writes:

> Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) wrote:
>> There is a research field known as "information filtering" or
>> "(automatic) text classification" or "text categorization".  I don't
>> know the details of the theory, but folks in that community are
>> speaking of "naive Bayes classifiers" as one of the ways to do it --
>> maybe that's similar to his approach.
>
> Sounds like it.  Anyone know if this (or another) method generalizes
> to more than two categories (spam/nonspam)?  If so, it could be used
> for all mail splitting.  We wouldn't have to manually craft split
> rules; we'd just seed a new group with the mails we have so far that
> belong there, and their contents would let the computer guess which
> new mails belong with them.
[...]

Cool! I wonder if this technique could be abused to get a more
sophisticated adaptive scoring, too ...

    -- Oliver

-- 
2 Fructidor an 210 de la Révolution
Liberté, Egalité, Fraternité!





^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-19 10:50     ` Oliver Scholz
@ 2002-08-19 11:06       ` Kai Großjohann
  2002-08-19 14:55         ` Alex Schroeder
  0 siblings, 1 reply; 35+ messages in thread
From: Kai Großjohann @ 2002-08-19 11:06 UTC (permalink / raw)
  Cc: ding

Oliver Scholz <alkibiades@gmx.de> writes:

> Cool! I wonder if this technique could be abused to get a more
> sophisticated adaptive scoring, too ...

It technically possible, I guess.  Only somebody needs to implement
it.

kai
-- 
A large number of young women don't trust men with beards.  (BFBS Radio)



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-19  9:23 ` Paul Graham on fighting SPAM Alex Schroeder
@ 2002-08-19 11:29   ` Ted Zlatanov
  2002-08-19 15:09     ` Alex Schroeder
  0 siblings, 1 reply; 35+ messages in thread
From: Ted Zlatanov @ 2002-08-19 11:29 UTC (permalink / raw)
  Cc: ding

On Mon, 19 Aug 2002, alex@emacswiki.org wrote:
> I posted some code to g.e.sources to implement the basics.  If
> anybody feels like fooling around with it, I'd be happy to read
> about it.  There's also a comment by Kai on it in g.e.help.
> 
> * http://www.emacswiki.org/cgi-bin/wiki.pl?SpamStat
> 
> Things I'd like to see: Efficient storage and retrieval of the data
> from disk.  Based on 3351 mails, 298 of them being spam, I got a
> dictionary of 650k; preparing it used an intermediary file of 7m.
> Once saving is fast, I'd like to update the stats as we go along to
> avoid the long preparation times.  Updating the stats requires the
> original 7m of data, however.  So before delving into all of this,
> I'd prefer to see wether it works, see what other people think,
> collect some ideas and patches...

Do you want to integrate it with the current spam.el contents?  You
just need to add a function that uses spam-stats.el, that can be
invoked on a message buffer to return t or a number if spam is
detected, and nil otherwise.  See the spam-split function, it already
invokes the blackholes and whitelist/blacklist checks, and would
invoke your function as well.  I have to write the code to make those
checks user-selectable via some symbols, but that's a separate thing.

-- 
Teodor Zlatanov <tzz@iglou.com>
"Brevis oratio penetrat colos, longa potatio evacuat ciphos." -Rabelais




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Email filing.
  2002-08-19  5:44   ` Paul Jarc
  2002-08-19  8:53     ` Kai Großjohann
  2002-08-19 10:50     ` Oliver Scholz
@ 2002-08-19 14:12     ` Scott A Crosby
  2002-09-05 16:00       ` clemens fischer
  2 siblings, 1 reply; 35+ messages in thread
From: Scott A Crosby @ 2002-08-19 14:12 UTC (permalink / raw)

On Mon, 19 Aug 2002 01:44:05 -0400, prj@po.cwru.edu (Paul Jarc) writes:

> Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Gro~johann) wrote:
> > There is a research field known as "information filtering" or
> > "(automatic) text classification" or "text categorization".  I don't
> > know the details of the theory, but folks in that community are
> > speaking of "naive Bayes classifiers" as one of the ways to do it --
> > maybe that's similar to his approach.
> 
> Sounds like it.  Anyone know if this (or another) method generalizes
> to more than two categories (spam/nonspam)?  If so, it could be used

Yup.

> for all mail splitting.  We wouldn't have to manually craft split
> rules; we'd just seed a new group with the mails we have so far that

Been done. Look at ifile. I've heard its slow though. 

And that although it works, it doesn't always work as nicely as you'd
like, in that it classifies right only about 80-90% of the time, but
that 10% will annoy you. IMHO, its only really useful when you have
email thats uncatagorizable by any other means.

It could try to identify mailing lists by noting list-headers but I
wouldn't want to bet on perfect reliability.

Personally, I can accept writing mailing list filters right now.  Then
I sort out some emails by putting it into folders by-author, and
finally use gnus-split-with-parent to put project emails with their
parents in the folder automatically. [1]

This is easy and has perfect reliability. However, if I had gobs of
uncatagorizable email (whether mailing list or not), I may consider
trying to use such a bayesean classifier.[2]

For spam-checking, I'm, doing an implementation of something that does
naive bayesean, but is flexible enough to be used for this. A *very
fast* implementation.... my benchmark right now for the statistics
building is 5 seconds on a 35mb, 7500 message corpus. V2 should be 30%
faster.)

> belong there, and their contents would let the computer guess which
> new mails belong with them.

Scott

[2] It wouldn't be *too* hard to link in to something like gnus.
Whenever a user manually classifies an email into a folder, tack on a
header 'manually moved here'. My core, when building statistics,
ignores all messages without that header.  Yeah, its a brute-scan, but
its fast. :) To classify, have a split rule pipe the email to the core
which outputs the suggested catagory.

[1] Extracts from my .gnus:

Among nice features. If I'm in a project/sender folder and send a
message, it gets stored there autommatically (by gcc-self). For all
non-mailing list emails, they obey my-split-fancy-with-parent.

;; Where to store responses:
;;  I use GCC-self as a topic parameter in sender.*/project.*, which
;;  are not marked total-expire. For everything else, put it by-date.
(setq gnus-message-archive-group
      '(;; Sent list email messages to list email.
;	("blah.*" "nnml:store_folder") ;Example for constant regexp matches
        (lambda (x) (cond 
		     ;; Store in existing group. (Done above with gcc-self)
;		      ((string-match "mail.*" group) (concat "nnml:" group))
;		      ((string-match "project.*" group) (concat "nnml:" group))
;		      ((string-match "sender.*" group) (concat "nnml:" group))
		      ;; If no match, just dump 
		      (t 
		       (concat "nnml:sent-mesg." (format-time-string "%Y-%m")))))))

(defun my-split-fancy-with-parent ()
  "Do a split-with-parent, however, ignore the result if it wants to put it in sent-mesg. This way, we put followups in the same group, however, we never put followups into sent-mesg."

  (let ((my-split (nnmail-split-fancy-with-parent)))
    ;;(message "Doing-my-split-fancy-with-parent")
    ;;(message my-split)
    (if (or (null my-split) (string-match "sent-mesg" my-split))
	nil
      my-split)))

(setq nnmail-split-fancy
      '(| 
	(| ;; Loop/Spam/special rules
	 ("X-Spam-Status" "Yes" "junk.spam")
	 ("Loop" "s?crosby@.*\.edu" "inbox.loop")
	 ("X-Delivery-Agent" "TMDA" "junk.tmda")
	 ("Delivery-Agent" "TMDA" "junk.tmda")
	 )
	(| ;; LIST RULES
	 ("Sender" "spamassassin-talk-admin@lists.sourceforge.net" "list.dev.spamassassin.talk")
	 ("Sender" "spamassassin-devel-admin@lists.sourceforge.net" "list.dev.spamassassin.dev")
	 ("Delivered-To" "alias-ding@gnus.org" "list.dev.gnus.ding")
	(: my-split-fancy-with-parent)
	(| ;; PROJECT RULES
	(| ;; SENDER RULES
	 ("from" "XXXXXXXXXX" "sender.family.mom")
	 ("from" "YYYYY" "sender.FOOBAR")
	 )
	(| ;; INBOX RULES
	 ;; Dump duplicates into dups. 	  
	 ("Gnus-warning" "duplicat\\(e\\|ion\\) of message" "junk.duplicates")
	 ("X-Spam-Level" "\\*\\*\\*\\*" "inbox.spam")
	 ("Precedence" "bulk\\|list" "inbox.bulk")
	 )
	"inbox.misc"
	)
      )

##

This below doesn't work pre-oorts, but it gives you an idea of what I wanted my group parameters to be. (I do it now by setting topic parameters.)

'(setq gnus-group-parameters
      '(
	(".*"
	 (display . all))
;; Projects and by-sender boxes get responses stored in same.
	("^\\(project|sender\\)\\..*"
         (gcc-self . t))

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-19 11:06       ` Kai Großjohann
@ 2002-08-19 14:55         ` Alex Schroeder
  2002-08-19 17:09           ` Kai Großjohann
  0 siblings, 1 reply; 35+ messages in thread
From: Alex Schroeder @ 2002-08-19 14:55 UTC (permalink / raw)

Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:

> Oliver Scholz <alkibiades@gmx.de> writes:
>
>> Cool! I wonder if this technique could be abused to get a more
>> sophisticated adaptive scoring, too ...
>
> It technically possible, I guess.  Only somebody needs to implement
> it.

I'd be interested in writing the elisp to do it -- but where do I get
the algorithms from?  Paul Graham's article had code ready to use on
his page.  Are there similar down-to-earth examples around?

Alex.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-19 11:29   ` Ted Zlatanov
@ 2002-08-19 15:09     ` Alex Schroeder
  2002-08-19 16:23       ` Ted Zlatanov
  2002-08-19 17:09       ` Kai Großjohann
  0 siblings, 2 replies; 35+ messages in thread
From: Alex Schroeder @ 2002-08-19 15:09 UTC (permalink / raw)

Ted Zlatanov <tzz@lifelogs.com> writes:

> Do you want to integrate it with the current spam.el contents?  You
> just need to add a function that uses spam-stats.el, that can be
> invoked on a message buffer to return t or a number if spam is
> detected, and nil otherwise.  See the spam-split function, it already
> invokes the blackholes and whitelist/blacklist checks, and would
> invoke your function as well.  I have to write the code to make those
> checks user-selectable via some symbols, but that's a separate thing.

I would not mind adding the glue -- if people would use it.  Remember
that you have to create that dictionary, first!  So this is not
something that just works out of the box -- unless we distribute such
a dictionary from the web.  Maybe not a bad idea:

~% cat .spam-stat.el | gzip - | wc --bytes
 192390

Did you use spam-stat.el to create a dictionary for yourself to test
it with?  I'd be interested in hearing about it.

Alex.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-19 15:09     ` Alex Schroeder
@ 2002-08-19 16:23       ` Ted Zlatanov
  2002-08-19 22:22         ` Alex Schroeder
  2002-08-19 17:09       ` Kai Großjohann
  1 sibling, 1 reply; 35+ messages in thread
From: Ted Zlatanov @ 2002-08-19 16:23 UTC (permalink / raw)
  Cc: ding

On Mon, 19 Aug 2002, alex@emacswiki.org wrote:
> Ted Zlatanov <tzz@lifelogs.com> writes:
> 
>> Do you want to integrate it with the current spam.el contents?  You
>> just need to add a function that uses spam-stats.el, that can be
>> invoked on a message buffer to return t or a number if spam is
>> detected, and nil otherwise.  See the spam-split function, it
>> already invokes the blackholes and whitelist/blacklist checks, and
>> would invoke your function as well.  I have to write the code to
>> make those checks user-selectable via some symbols, but that's a
>> separate thing.
> 
> I would not mind adding the glue -- if people would use it.
> Remember that you have to create that dictionary, first!  So this is
> not something that just works out of the box -- unless we distribute
> such a dictionary from the web.  Maybe not a bad idea:
> 
> ~% cat .spam-stat.el | gzip - | wc --bytes
>  192390
> 
> Did you use spam-stat.el to create a dictionary for yourself to test
> it with?  I'd be interested in hearing about it.

I haven't had the chance to use spam-stat.el myself (I'm job-hunting
at the moment).  The code looks good, though.

I would suggest that users can build a dictionary based on messages
they mark as spam.  Gnus supports a spam mark; another direction I
wanted to take spam.el besides splitting rules is to do summary exit
hooks where spam gets processed somehow (blacklisted, submitted to a
spam detection center, etc.)

It would seem that the ideal place to build the dictionary, therefore,
is on a summary exit hook.  The list of articles marked as spam is
available (gnus-summary-spam-marked) so all you need to give me is a
function that, given a message we know is spam, can increment the user
dictionary.  I'll write the function that applies that function to the
list of spam messages, unless you feel like doing that part too :)

Letting the user determine what's spam is probably the best strategy.
I imagine some corporate memos, for instance, could get classified as
spam with a default dictionary.

Thanks
Ted

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-19 14:55         ` Alex Schroeder
@ 2002-08-19 17:09           ` Kai Großjohann
  0 siblings, 0 replies; 35+ messages in thread
From: Kai Großjohann @ 2002-08-19 17:09 UTC (permalink / raw)
  Cc: ding

Alex Schroeder <alex@emacswiki.org> writes:

> I'd be interested in writing the elisp to do it -- but where do I get
> the algorithms from?  Paul Graham's article had code ready to use on
> his page.  Are there similar down-to-earth examples around?

Not sure.  I only know about the research in this area.  Hm.  Ah:

http://liinwww.ira.uka.de/bibliography/Ai/automated.text.categorization.html

There is a bibliography there.  Fabrizio has also written a nice
overview-like paper, but I forget where.  But from his homepage, you
get a multitude of papers.

kai
-- 
A large number of young women don't trust men with beards.  (BFBS Radio)



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-19 15:09     ` Alex Schroeder
  2002-08-19 16:23       ` Ted Zlatanov
@ 2002-08-19 17:09       ` Kai Großjohann
  2002-08-19 22:19         ` Alex Schroeder
  1 sibling, 1 reply; 35+ messages in thread
From: Kai Großjohann @ 2002-08-19 17:09 UTC (permalink / raw)
  Cc: ding

Alex Schroeder <alex@emacswiki.org> writes:

> I would not mind adding the glue -- if people would use it.  Remember
> that you have to create that dictionary, first!  So this is not
> something that just works out of the box -- unless we distribute such
> a dictionary from the web.

You can tell spam.el to create/update the dictionary whenever a
message is marked as spam.

kai
-- 
A large number of young women don't trust men with beards.  (BFBS Radio)



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-19 17:09       ` Kai Großjohann
@ 2002-08-19 22:19         ` Alex Schroeder
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Schroeder @ 2002-08-19 22:19 UTC (permalink / raw)

Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:

> Alex Schroeder <alex@emacswiki.org> writes:
>
>> I would not mind adding the glue -- if people would use it.  Remember
>> that you have to create that dictionary, first!  So this is not
>> something that just works out of the box -- unless we distribute such
>> a dictionary from the web.
>
> You can tell spam.el to create/update the dictionary whenever a
> message is marked as spam.

That's cool.  Ted Zlatanov also had interesting ideas.  My problem is
that the algorithm requires several megabytes of data structures.  I
think a lot of this can be simplified, so perhaps we can reduce this
by 50% or more.  I'll give it a try.

Alex.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-19 16:23       ` Ted Zlatanov
@ 2002-08-19 22:22         ` Alex Schroeder
  2002-08-20  7:42           ` Alex Schroeder
  0 siblings, 1 reply; 35+ messages in thread
From: Alex Schroeder @ 2002-08-19 22:22 UTC (permalink / raw)


Ted Zlatanov <tzz@lifelogs.com> writes:

> It would seem that the ideal place to build the dictionary, therefore,
> is on a summary exit hook.  The list of articles marked as spam is
> available (gnus-summary-spam-marked) so all you need to give me is a
> function that, given a message we know is spam, can increment the user
> dictionary.  I'll write the function that applies that function to the
> list of spam messages, unless you feel like doing that part too :)

That sounds very good!  I didn't know about this spam flag (I didn't
read gnus.ding for several months).  That makes things a lot easier.
At the moment I am relying on the group name -- that sucks.  :)

Alex.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-19 22:22         ` Alex Schroeder
@ 2002-08-20  7:42           ` Alex Schroeder
  2002-08-20 12:00             ` Ted Zlatanov
  0 siblings, 1 reply; 35+ messages in thread
From: Alex Schroeder @ 2002-08-20  7:42 UTC (permalink / raw)

Alex Schroeder <alex@emacswiki.org> writes:

> That sounds very good!  I didn't know about this spam flag (I didn't
> read gnus.ding for several months).  That makes things a lot easier.
> At the moment I am relying on the group name -- that sucks.  :)

If we do not move the mails away, however, we need to make sure that
every mail is processed only once.  Would it be possible to add an
additional header or flag when processing a mail?

The alternative to using a flag is one magic group such as mail.spam:
Process all new mail, move it to mail.spam.  If the user then moves
mail from a group to mail.spam, we can undo the "good" scores for that
mail and redo a "bad" score for it -- and the other way around.  The
current approach is very much like this (although still incomplete).

Alex.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-20  7:42           ` Alex Schroeder
@ 2002-08-20 12:00             ` Ted Zlatanov
  2002-08-22  2:21               ` Alex Schroeder
  2002-08-26 21:55               ` Alex Schroeder
  0 siblings, 2 replies; 35+ messages in thread
From: Ted Zlatanov @ 2002-08-20 12:00 UTC (permalink / raw)
  Cc: ding

On Tue, 20 Aug 2002, alex@emacswiki.org wrote:
> Alex Schroeder <alex@emacswiki.org> writes:
> 
>> That sounds very good!  I didn't know about this spam flag (I
>> didn't read gnus.ding for several months).  That makes things a lot
>> easier.  At the moment I am relying on the group name -- that
>> sucks.  :)
> 
> If we do not move the mails away, however, we need to make sure that
> every mail is processed only once.  Would it be possible to add an
> additional header or flag when processing a mail?
> 
> The alternative to using a flag is one magic group such as
> mail.spam: Process all new mail, move it to mail.spam.  If the user
> then moves mail from a group to mail.spam, we can undo the "good"
> scores for that mail and redo a "bad" score for it -- and the other
> way around.  The current approach is very much like this (although
> still incomplete).

Why not just mark spam messages that have been processed as expired?
I doubt users will want to keep spam messages around :)

-- 
Teodor Zlatanov <tzz@iglou.com>
"Brevis oratio penetrat colos, longa potatio evacuat ciphos." -Rabelais




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-19  8:53     ` Kai Großjohann
@ 2002-08-21  1:14       ` news
  2002-08-27 23:03         ` Nathan J. Williams
  0 siblings, 1 reply; 35+ messages in thread
From: news @ 2002-08-21  1:14 UTC (permalink / raw)


Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:

> prj@po.cwru.edu (Paul Jarc) writes:
>
>> Sounds like it.  Anyone know if this (or another) method generalizes
>> to more than two categories (spam/nonspam)?  If so, it could be used
>> for all mail splitting.  We wouldn't have to manually craft split
>> rules; we'd just seed a new group with the mails we have so far that
>> belong there, and their contents would let the computer guess which
>> new mails belong with them.
>
> I think you can assume that text classifiers can choose one out of N
> categories.  In some cases, this is implemented by having N yes/no
> classifiers and choosing the result from the one with the highest
> confidence value, but this is only an implementation detail.
>
> Using automatic classification for splitting would really be way
> cool.  Are there any mail readers which do this?

Mew does this with its mew-refile-guess-* functions, lists, and
variables.  It's not strictly automatic, but remembers and learns
from previous refiles (splitting, in gnusspeak.)

Chris



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-20 12:00             ` Ted Zlatanov
@ 2002-08-22  2:21               ` Alex Schroeder
  2002-08-22 16:32                 ` Ted Zlatanov
  2002-08-26 21:55               ` Alex Schroeder
  1 sibling, 1 reply; 35+ messages in thread
From: Alex Schroeder @ 2002-08-22  2:21 UTC (permalink / raw)

Ted Zlatanov <tzz@lifelogs.com> writes:

> Why not just mark spam messages that have been processed as expired?
> I doubt users will want to keep spam messages around :)

I still do not understand.  Here is my problem again:

When the user leaves the summary buffer, there are five possibilities:

1. The message was new and the user thinks it is non-spam.
   Call spam-stat-buffer-is-non-spam.
2. The message was new and the user thinks it is spam.
   Call spam-stat-buffer-is-spam.
3. The message was old and marked as spam but the user no longer
   thinks it is spam.  Call spam-stat-buffer-change-to-non-spam.
4. The message was old and marked as non-spam but the user now thinks
   it is spam.  Call spam-stat-buffer-change-to-spam.
5. The message is old and its spam-status unchanged, or unread.
   Nothing happens.

How to distinguish between these cases?

Note that I posted a new version to g.e.sources, which requires less
effort to build, can be updated on the fly, and has a tiny testing
infrastructure, plus test data by myself with a tiny set of mails.

Alex.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-22  2:21               ` Alex Schroeder
@ 2002-08-22 16:32                 ` Ted Zlatanov
  2002-08-22 16:57                   ` Ted Zlatanov
  0 siblings, 1 reply; 35+ messages in thread
From: Ted Zlatanov @ 2002-08-22 16:32 UTC (permalink / raw)
  Cc: ding

On Thu, 22 Aug 2002, alex@emacswiki.org wrote:
> When the user leaves the summary buffer, there are five
> possibilities:
> 
> 1. The message was new and the user thinks it is non-spam.  Call
>    spam-stat-buffer-is-non-spam.  

Those articles do not have the spam-mark.  We should make it optional
for users to process all non-spam articles this way (since some
backends may be too slow).  Also, we need a way to mark that articles
have been run through the process already - I can't think of such a
way.

> 2. The message was new and the user thinks it is spam.  Call
> spam-stat-buffer-is-spam.  

Those articles have the spam-mark.  I think
gnus-spam-stat-process-buffer looks like the right function to call on
articles marked as spam; is there a corresponding function (or reuse
of this one) to process articles that are *not* spam?  How do you want
to handle the file setting for the word table - should it be a string
passed to a setup function on your side, or should we keep it in your
code and tell users to customize it?

Will your stats code be a part of Gnus?  Would you like to merge it
with spam.el, so all the spam code is in one place?

> 3. The message was old and marked as spam but the user no longer
> thinks it is spam.  Call spam-stat-buffer-change-to-non-spam.

> 4. The message was old and marked as non-spam but the user now
> thinks it is spam.  Call spam-stat-buffer-change-to-spam.  

> 5. The message is old and its spam-status unchanged, or unread.
> Nothing happens.

Messages should be marked as spam only once, and I think they should
be marked as expired once they are processed as spam at the summary
exit hook.  Are you suggesting that we leave the spam-mark on?  It's a
primary mark, so it can't coexist with the read or unread or expired
marks.  So I don't think there's a way currently to have a message be
both old and marked as spam.

We could make the spam mark a secondary mark, but the original idea
was to have spam marked once, processed, and expired.  I'd like to
know if anyone thinks the spam messages should be kept around in a way
that expiration (deletion or to a folder) can't handle.

> Note that I posted a new version to g.e.sources, which requires less
> effort to build, can be updated on the fly, and has a tiny testing
> infrastructure, plus test data by myself with a tiny set of mails.

I saw that, it looks great.  I'd love to hook or include it into
spam.el.

-- 
Teodor Zlatanov <tzz@iglou.com>
"Brevis oratio penetrat colos, longa potatio evacuat ciphos." -Rabelais

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-22 16:32                 ` Ted Zlatanov
@ 2002-08-22 16:57                   ` Ted Zlatanov
  2002-08-22 17:57                     ` Kai Großjohann
  2002-08-22 20:07                     ` Alex Schroeder
  0 siblings, 2 replies; 35+ messages in thread
From: Ted Zlatanov @ 2002-08-22 16:57 UTC (permalink / raw)
  Cc: ding

On Thu, 22 Aug 2002, tzz@lifelogs.com wrote:
> Those articles have the spam-mark.  I think
> gnus-spam-stat-process-buffer looks like the right function to call
> on articles marked as spam; is there a corresponding function (or
> reuse of this one) to process articles that are *not* spam?  How do
> you want to handle the file setting for the word table - should it
> be a string passed to a setup function on your side, or should we
> keep it in your code and tell users to customize it?

Whoops.  I was looking at an older version of your code by accident.
Your message makes a lot more sense with version 0.04 :)

The functions you have are great, we just need to figure out how to
preserve the state of an article marked as spam, for the
spam<->not_spam transitions, or to go with the original idea of
expiring all spam mail that's been processed already.

For new incoming spam, should we *also* run the spam analysis on a
buffer that we think is spam, or is the spam analysis only for buffers
that the user explicitly marks as spam?

-- 
Teodor Zlatanov <tzz@iglou.com>
"Brevis oratio penetrat colos, longa potatio evacuat ciphos." -Rabelais

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-22 16:57                   ` Ted Zlatanov
@ 2002-08-22 17:57                     ` Kai Großjohann
  2002-08-22 18:42                       ` Ted Zlatanov
  2002-08-22 19:59                       ` Alex Schroeder
  2002-08-22 20:07                     ` Alex Schroeder
  1 sibling, 2 replies; 35+ messages in thread
From: Kai Großjohann @ 2002-08-22 17:57 UTC (permalink / raw)
  Cc: ding

Ted Zlatanov <tzz@lifelogs.com> writes:

> For new incoming spam, should we *also* run the spam analysis on a
> buffer that we think is spam, or is the spam analysis only for buffers
> that the user explicitly marks as spam?

I started to compose an answer saying yes.  Then I thought about it
and deleted it and started to compose an answer saying no.  Then I
changed my mind again...  and then I figured that this is a
nontrivial question.

Maybe the folks over at comp.theory.info-retrieval know more about
this?  There should be some experts hanging out there.

kai
-- 
A large number of young women don't trust men with beards.  (BFBS Radio)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-22 17:57                     ` Kai Großjohann
@ 2002-08-22 18:42                       ` Ted Zlatanov
  2002-08-22 19:59                       ` Alex Schroeder
  1 sibling, 0 replies; 35+ messages in thread
From: Ted Zlatanov @ 2002-08-22 18:42 UTC (permalink / raw)


On Thu, 22 Aug 2002, Kai.Grossjohann@CS.Uni-Dortmund.DE wrote:
> Ted Zlatanov <tzz@lifelogs.com> writes:
> 
>> For new incoming spam, should we *also* run the spam analysis on a
>> buffer that we think is spam, or is the spam analysis only for
>> buffers that the user explicitly marks as spam?
> 
> I started to compose an answer saying yes.  Then I thought about it
> and deleted it and started to compose an answer saying no.  Then I
> changed my mind again...  and then I figured that this is a
> nontrivial question.
> 
> Maybe the folks over at comp.theory.info-retrieval know more about
> this?  There should be some experts hanging out there.

I think we should make it an option, and let the user decide :)

(setq gnus-spam-stats-analyze-incoming t) ; analyze all spam, the default
(setq gnus-spam-stats-analyze-incoming 'some) ; analyze all spam submitted by user
(setq gnus-spam-stats-analyze-incoming nil) ; don't analyze spam

I think there are users that like each of the three approaches, so
trying to decide what's best for them is difficult.

So the stats analysis (Alex's work) will always be loaded in the
spam-split function in spam.el, but the gnus-spam-stats-analyze
variable will determine whether the stats analysis will be used or
not.  I think that makes sense.

I also think I'll have similar variables for ordb and blacklists:

(setq gnus-spam-check-ordb-incoming t) ; check ORDB
(setq gnus-spam-check-ordb-incoming nil) ; don't check ORDB, the default

(setq gnus-spam-check-blacklist-incoming t) ; check blacklist, the default
(setq gnus-spam-check-blacklist-incoming nil) ; don't check the blacklist

Would this be OK with everyone?  Or would people prefer the
symbol-based lists, something like

(add-to-list gnus-spam-split-incoming-checks 'ordb)
(add-to-list gnus-spam-split-incoming-checks 'blacklist)
(add-to-list gnus-spam-split-incoming-checks 'stats-analyze-all)
(add-to-list gnus-spam-split-incoming-checks 'stats-analyze-some)

It seems the latter approach is more flexible, but harder for novice
users to understand and implement.  I think the implementation will be
equally easy for both approaches.

-- 
Teodor Zlatanov <tzz@iglou.com>
"Brevis oratio penetrat colos, longa potatio evacuat ciphos." -Rabelais




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-22 17:57                     ` Kai Großjohann
  2002-08-22 18:42                       ` Ted Zlatanov
@ 2002-08-22 19:59                       ` Alex Schroeder
  1 sibling, 0 replies; 35+ messages in thread
From: Alex Schroeder @ 2002-08-22 19:59 UTC (permalink / raw)

Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:

> Maybe the folks over at comp.theory.info-retrieval know more about
> this?  There should be some experts hanging out there.

Hehe, thought about this as well.  After all, the new message was
considered to be spam, so spam-stat might as well learn about the new
words in it.  If the user then changes his mind, I have functions to
deal with that.  What do you say?  After all, the message is *already*
in mail.spam, so to answer "yes" means to learn as we go, let the user
change the scores if he wants; to answer "no" means that we only learn
after the user has confirmed the decisions spam-stat made.

If we were talking about a neural network, then there'd be the
question of overtraining, going of on tangents, and no way to undo
it.  But all spam-stat knows is "how often was this word in the good
and the bad mails?" and "how many good and bad mails are there?" --
that is easy to undo.  Lucky us.  :)

Alex.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-22 16:57                   ` Ted Zlatanov
  2002-08-22 17:57                     ` Kai Großjohann
@ 2002-08-22 20:07                     ` Alex Schroeder
  2002-08-22 20:54                       ` Ted Zlatanov
  1 sibling, 1 reply; 35+ messages in thread
From: Alex Schroeder @ 2002-08-22 20:07 UTC (permalink / raw)

Ted Zlatanov <tzz@lifelogs.com> writes:

> The functions you have are great, we just need to figure out how to
> preserve the state of an article marked as spam, for the
> spam<->not_spam transitions, or to go with the original idea of
> expiring all spam mail that's been processed already.

Maybe we could move spam into mail.spam, with expirable.  If the user
then moves mail from or to mail.spam, this indicates transitions and
must be handled immediately (instead of waiting for summary buffer
exit).  What do you think?  Is that possible to do in a
backend-independent way?

> Will your stats code be a part of Gnus?  Would you like to merge it
> with spam.el, so all the spam code is in one place?

I have no idea.  I would like to contribute it, once it is "done".  :)
That includes the FSF paperwork.  Once I think it is "done", I will
write to assign@gnu.org.  (Assuming that spam.el is part of Gnus which
is also part of Emacs.)

Alex.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-22 20:07                     ` Alex Schroeder
@ 2002-08-22 20:54                       ` Ted Zlatanov
  0 siblings, 0 replies; 35+ messages in thread
From: Ted Zlatanov @ 2002-08-22 20:54 UTC (permalink / raw)
  Cc: ding

On Thu, 22 Aug 2002, alex@emacswiki.org wrote:
> Ted Zlatanov <tzz@lifelogs.com> writes:
> 
>> The functions you have are great, we just need to figure out how to
>> preserve the state of an article marked as spam, for the
>> spam<->not_spam transitions, or to go with the original idea of
>> expiring all spam mail that's been processed already.
> 
> Maybe we could move spam into mail.spam, with expirable.  If the
> user then moves mail from or to mail.spam, this indicates
> transitions and must be handled immediately (instead of waiting for
> summary buffer exit).  What do you think?  Is that possible to do in
> a backend-independent way?

I don't like a special newsgroup for this purpose.  For instance, NNTP
spam couldn't be treated in this way.  There was a past discussion
about when the spam analysis should be done, and the general agreement
was that spam should be blacklisted/submitted to ORDB/analyzed at the
time you exit the summary, not at the time you set the spam-mark on an
article.  The user could manually run the summary-exit-process-spam
(to pick a name) function that would normally be in the summary exit
hook, but the consensus was that the action should be delayed.

Also, splitting mail into a separate group is only one possible way to
treat spam.  The splitting function could just set the spam-mark on an
article that it processes, and let the other split rules handle where
the article should go.  I don't know if that's possible in a split
rule, though.  It may have to be done before splitting, at the time
new articles are obtained from the backend.

>> Will your stats code be a part of Gnus?  Would you like to merge it
>> with spam.el, so all the spam code is in one place?
> 
> I have no idea.  I would like to contribute it, once it is "done".
> :) That includes the FSF paperwork.  Once I think it is "done", I
> will write to assign@gnu.org.  (Assuming that spam.el is part of
> Gnus which is also part of Emacs.)

It's up to you how you want to contribute your code, as a separate
package or as a part of spam.el.  I have no problem with either
approach.  I think it's better to keep all spam-related code,
including yours, in one place (spam.el) to simplify documentation and
future extensions.  spam.el is part of Gnus (in the lisp subdirectory
in CVS and latest Oort).  So if you feel like looking at the current
spam.el and adding your source to it, I can commit that to CVS with
the proper attribution (or I can do the merge if you wish).  Of
course, feel free to continue a separate package if you prefer that.

Thanks
Ted

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-20 12:00             ` Ted Zlatanov
  2002-08-22  2:21               ` Alex Schroeder
@ 2002-08-26 21:55               ` Alex Schroeder
  2002-08-26 23:19                 ` Alex Schroeder
                                   ` (2 more replies)
  1 sibling, 3 replies; 35+ messages in thread
From: Alex Schroeder @ 2002-08-26 21:55 UTC (permalink / raw)


Anyway, what shall we do with spam-stat.el, now?  An ifile user
suggested I write code to reduce the dictionary size again -- perhaps
I should remove all the words occuring less than 5 times, and all
words whose spaminess is close to 0.5 (common words occuring both in
spam and non-spam), and only the first few kb of all mails should be
analyzed.  Maybe I should write a sample usage that updates the
dictionary whenever you move mails into or out of the mail.spam group.
I know this is not what Teodor Zlatanov has in mind, but at least I
think I could do it myself.

Has anybody used spam-stat.el at all?  I must confess that I have not
yet added it to the fancy split rules myself, so perhaps we should
first start using it, before we start improving it.

Or should I just assign the copyright of spam-stat.el, then we can
move it into spam.el or whatever, and people will fix it as spam.el
gets wider usage?

Alex.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-26 21:55               ` Alex Schroeder
@ 2002-08-26 23:19                 ` Alex Schroeder
  2002-08-28  6:40                 ` Piers Cawley
  2002-08-29  2:46                 ` Ted Zlatanov
  2 siblings, 0 replies; 35+ messages in thread
From: Alex Schroeder @ 2002-08-26 23:19 UTC (permalink / raw)


Alex Schroeder <alex@emacswiki.org> writes:

>   An ifile user suggested I write code to reduce the dictionary size
> again -- perhaps I should remove all the words occuring less than 5
> times, and all words whose spaminess is close to 0.5 (common words
> occuring both in spam and non-spam)

I was bored, so I implemented this.  A new version is in
gnu.emacs.sources.  It reduced the dictionary file from over 500k to
below 100k.  Sounds good!  :)

Alex.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-21  1:14       ` news
@ 2002-08-27 23:03         ` Nathan J. Williams
  0 siblings, 0 replies; 35+ messages in thread
From: Nathan J. Williams @ 2002-08-27 23:03 UTC (permalink / raw)

Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:

> Using automatic classification for splitting would really be way
> cool.  Are there any mail readers which do this?

Jeremy Brown has beaten ifile into doing this with Gnus, and I've
recently adopted it. I trained it on my existing mail spool, and then
replaced 250 lines of nnmail-split-fancy rules with one ifile
invocation.

His information on using this can be found at:

http://www.ai.mit.edu/~jhbrown/ifile-gnus.html

It's all working with 5.8-series Gnus, so it doesn't do spam-marks, or
anything fancy and modern like that.

        - Nathan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-26 21:55               ` Alex Schroeder
  2002-08-26 23:19                 ` Alex Schroeder
@ 2002-08-28  6:40                 ` Piers Cawley
  2002-08-28 18:44                   ` Alex Schroeder
  2002-08-29  2:46                 ` Ted Zlatanov
  2 siblings, 1 reply; 35+ messages in thread
From: Piers Cawley @ 2002-08-28  6:40 UTC (permalink / raw)
  Cc: ding

Alex Schroeder <alex@emacswiki.org> writes:

> Anyway, what shall we do with spam-stat.el, now?  An ifile user
> suggested I write code to reduce the dictionary size again -- perhaps
> I should remove all the words occuring less than 5 times, 

Then how do you incrementally change the dictionary? Where do you
magically remember that some word has been seen 4 times already and
should therefore go in the dictionary this time.

> and all words whose spaminess is close to 0.5 (common words occuring
> both in spam and non-spam), 

That sounds like a bad idea, maybe if a word hangs around the .5 mark
for a certain number of dictionary builds...

> and only the first few kb of all mails should be analyzed. 

Only if you can tune that.

-- 
Piers

   "It is a truth universally acknowledged that a language in
    possession of a rich syntax must be in need of a rewrite."
         -- Jane Austen?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-28  6:40                 ` Piers Cawley
@ 2002-08-28 18:44                   ` Alex Schroeder
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Schroeder @ 2002-08-28 18:44 UTC (permalink / raw)
  Cc: ding

Piers Cawley <pdcawley-ding@bofh.org.uk> writes:

> Then how do you incrementally change the dictionary? Where do you
> magically remember that some word has been seen 4 times already and
> should therefore go in the dictionary this time.
etc.

Exactly,  reducing the size can only happen from time to time --
seldomly, if at all.

>> and only the first few kb of all mails should be analyzed. 
>
> Only if you can tune that.

Yes, a lot of stuff is not in a defvar, yet; and the defvars need to
be changed to defcustoms.  This will change eventually.  :)

Alex.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Paul Graham on fighting SPAM
  2002-08-26 21:55               ` Alex Schroeder
  2002-08-26 23:19                 ` Alex Schroeder
  2002-08-28  6:40                 ` Piers Cawley
@ 2002-08-29  2:46                 ` Ted Zlatanov
  2 siblings, 0 replies; 35+ messages in thread
From: Ted Zlatanov @ 2002-08-29  2:46 UTC (permalink / raw)
  Cc: ding

On Mon, 26 Aug 2002, alex@emacswiki.org wrote:
> Has anybody used spam-stat.el at all?  I must confess that I have
> not yet added it to the fancy split rules myself, so perhaps we
> should first start using it, before we start improving it.

I have played with it, but have not had the time to do a lot.

> Anyway, what shall we do with spam-stat.el, now?  An ifile user
> suggested I write code to reduce the dictionary size again --
> perhaps I should remove all the words occuring less than 5 times,
> and all words whose spaminess is close to 0.5 (common words occuring
> both in spam and non-spam), and only the first few kb of all mails
> should be analyzed.  Maybe I should write a sample usage that
> updates the dictionary whenever you move mails into or out of the
> mail.spam group.  I know this is not what Teodor Zlatanov has in
> mind, but at least I think I could do it myself.

> Or should I just assign the copyright of spam-stat.el, then we can
> move it into spam.el or whatever, and people will fix it as spam.el
> gets wider usage?

It's entirely up to you, as the author of spam-stat.el.  I have no
problem with the package remaining separate from spam.el - I will
simply document the fact that users who want to use the statistics
feature can download your package and do a load-library before adding
its functions to the spam-split sequence or in the summary exit hooks.

Judging by your first paragraph, you'd rather continue your work
independently, and it would probably benefit everyone that you do so.
As long as you keep me in the loop (via gnu.emacs.sources postings or
e-mail) I will try to synchronize spam.el with changes that you make
to functionality or function names in spam-stat.el.

For those who are interested in using spam.el: I will follow up with a
separate summary to ding.

Ted

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Email filing.
  2002-08-19 14:12     ` Email filing Scott A Crosby
@ 2002-09-05 16:00       ` clemens fischer
  2002-12-29 22:35         ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 35+ messages in thread
From: clemens fischer @ 2002-09-05 16:00 UTC (permalink / raw)

Scott A Crosby <scrosby@cs.rice.edu> writes:

> Been done. Look at ifile. I've heard its slow though. 
>
> And that although it works, it doesn't always work as nicely as you'd
> like, in that it classifies right only about 80-90% of the time, but
> that 10% will annoy you. IMHO, its only really useful when you have
> email thats uncatagorizable by any other means.

i have much better numbers with ifile.  where it fails it can be
attributed to not doing MIME, but /that would slow it!/.  mr. browne
has proposed a workaround, though, which just identifies the MIME
parts and/or encodings.  this would make ifile be more accurate in the
typical text-group, where it is intended to block spam.  then there's
always the possibility to mime-decode messages before classifying it.

my experiments show that ifile is good, both regarding accuracy and
speed, but i do use a tuned system with a procmail preprocessor.  the
"recipes" don't do any classification, they throw out chinese and
spamtool generated garbage.  incidentally, procmail uses most of the
time needed to categorize my email.

> It could try to identify mailing lists by noting list-headers but I
> wouldn't want to bet on perfect reliability.

it is easy to support this:  i have procmail tag messages to
mailinglists with a simple "X-Mailinglist: true" header early on, and
ifile adjusts nicely, including it in its statistics.

> For spam-checking, I'm, doing an implementation of something that does
> naive bayesean, but is flexible enough to be used for this. A *very
> fast* implementation.... my benchmark right now for the statistics
> building is 5 seconds on a 35mb, 7500 message corpus. V2 should be 30%
> faster.)

sounds very impressive.  is it for spam/non-spam checking only?

-- 
clemens

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Email filing.
  2002-09-05 16:00       ` clemens fischer
@ 2002-12-29 22:35         ` Lars Magne Ingebrigtsen
  0 siblings, 0 replies; 35+ messages in thread
From: Lars Magne Ingebrigtsen @ 2002-12-29 22:35 UTC (permalink / raw)


clemens fischer <ino-eb2dced1@spotteswoode.de.eu.org> writes:

> i have much better numbers with ifile.  where it fails it can be
> attributed to not doing MIME, but /that would slow it!/.  

ifile doesn't do MIME?  That's, er, strange for something that
handles mail.

Anyway, the gmime library does MIME stuff in a very speedy and
convenient manner.

-- 
(domestic pets only, the antidote for overdose, milk.)
   larsi@gnus.org * Lars Magne Ingebrigtsen



^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2002-12-29 22:35 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-16 17:10 Paul Graham on fighting SPAM Danny Siu
2002-08-17 19:43 ` Kai Großjohann
2002-08-19  5:44   ` Paul Jarc
2002-08-19  8:53     ` Kai Großjohann
2002-08-21  1:14       ` news
2002-08-27 23:03         ` Nathan J. Williams
2002-08-19 10:50     ` Oliver Scholz
2002-08-19 11:06       ` Kai Großjohann
2002-08-19 14:55         ` Alex Schroeder
2002-08-19 17:09           ` Kai Großjohann
2002-08-19 14:12     ` Email filing Scott A Crosby
2002-09-05 16:00       ` clemens fischer
2002-12-29 22:35         ` Lars Magne Ingebrigtsen
2002-08-19  9:23 ` Paul Graham on fighting SPAM Alex Schroeder
2002-08-19 11:29   ` Ted Zlatanov
2002-08-19 15:09     ` Alex Schroeder
2002-08-19 16:23       ` Ted Zlatanov
2002-08-19 22:22         ` Alex Schroeder
2002-08-20  7:42           ` Alex Schroeder
2002-08-20 12:00             ` Ted Zlatanov
2002-08-22  2:21               ` Alex Schroeder
2002-08-22 16:32                 ` Ted Zlatanov
2002-08-22 16:57                   ` Ted Zlatanov
2002-08-22 17:57                     ` Kai Großjohann
2002-08-22 18:42                       ` Ted Zlatanov
2002-08-22 19:59                       ` Alex Schroeder
2002-08-22 20:07                     ` Alex Schroeder
2002-08-22 20:54                       ` Ted Zlatanov
2002-08-26 21:55               ` Alex Schroeder
2002-08-26 23:19                 ` Alex Schroeder
2002-08-28  6:40                 ` Piers Cawley
2002-08-28 18:44                   ` Alex Schroeder
2002-08-29  2:46                 ` Ted Zlatanov
2002-08-19 17:09       ` Kai Großjohann
2002-08-19 22:19         ` Alex Schroeder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).