Gnus development mailing list
 help / color / mirror / Atom feed
From: Scott A Crosby <scrosby@cs.rice.edu>
Subject: Email filing.
Date: 19 Aug 2002 09:12:48 -0500	[thread overview]
Message-ID: <oydznvjj5sv.fsf_-_@bert.cs.rice.edu> (raw)
In-Reply-To: <m3sn1bjtc4.fsf@multivac.cwru.edu>

On Mon, 19 Aug 2002 01:44:05 -0400, prj@po.cwru.edu (Paul Jarc) writes:

> Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Gro~johann) wrote:
> > There is a research field known as "information filtering" or
> > "(automatic) text classification" or "text categorization".  I don't
> > know the details of the theory, but folks in that community are
> > speaking of "naive Bayes classifiers" as one of the ways to do it --
> > maybe that's similar to his approach.
> 
> Sounds like it.  Anyone know if this (or another) method generalizes
> to more than two categories (spam/nonspam)?  If so, it could be used

Yup.

> for all mail splitting.  We wouldn't have to manually craft split
> rules; we'd just seed a new group with the mails we have so far that

Been done. Look at ifile. I've heard its slow though. 

And that although it works, it doesn't always work as nicely as you'd
like, in that it classifies right only about 80-90% of the time, but
that 10% will annoy you. IMHO, its only really useful when you have
email thats uncatagorizable by any other means.

It could try to identify mailing lists by noting list-headers but I
wouldn't want to bet on perfect reliability.

Personally, I can accept writing mailing list filters right now.  Then
I sort out some emails by putting it into folders by-author, and
finally use gnus-split-with-parent to put project emails with their
parents in the folder automatically. [1]

This is easy and has perfect reliability. However, if I had gobs of
uncatagorizable email (whether mailing list or not), I may consider
trying to use such a bayesean classifier.[2]

For spam-checking, I'm, doing an implementation of something that does
naive bayesean, but is flexible enough to be used for this. A *very
fast* implementation.... my benchmark right now for the statistics
building is 5 seconds on a 35mb, 7500 message corpus. V2 should be 30%
faster.)

> belong there, and their contents would let the computer guess which
> new mails belong with them.

Scott



[2] It wouldn't be *too* hard to link in to something like gnus.
Whenever a user manually classifies an email into a folder, tack on a
header 'manually moved here'. My core, when building statistics,
ignores all messages without that header.  Yeah, its a brute-scan, but
its fast. :) To classify, have a split rule pipe the email to the core
which outputs the suggested catagory.

[1] Extracts from my .gnus:

Among nice features. If I'm in a project/sender folder and send a
message, it gets stored there autommatically (by gcc-self). For all
non-mailing list emails, they obey my-split-fancy-with-parent.



;; Where to store responses:
;;  I use GCC-self as a topic parameter in sender.*/project.*, which
;;  are not marked total-expire. For everything else, put it by-date.
(setq gnus-message-archive-group
      '(;; Sent list email messages to list email.
;	("blah.*" "nnml:store_folder") ;Example for constant regexp matches
        (lambda (x) (cond 
		     ;; Store in existing group. (Done above with gcc-self)
;		      ((string-match "mail.*" group) (concat "nnml:" group))
;		      ((string-match "project.*" group) (concat "nnml:" group))
;		      ((string-match "sender.*" group) (concat "nnml:" group))
		      ;; If no match, just dump 
		      (t 
		       (concat "nnml:sent-mesg." (format-time-string "%Y-%m")))))))


(defun my-split-fancy-with-parent ()
  "Do a split-with-parent, however, ignore the result if it wants to put it in sent-mesg. This way, we put followups in the same group, however, we never put followups into sent-mesg."

  (let ((my-split (nnmail-split-fancy-with-parent)))
    ;;(message "Doing-my-split-fancy-with-parent")
    ;;(message my-split)
    (if (or (null my-split) (string-match "sent-mesg" my-split))
	nil
      my-split)))



(setq nnmail-split-fancy
      '(| 
	(| ;; Loop/Spam/special rules
	 ("X-Spam-Status" "Yes" "junk.spam")
	 ("Loop" "s?crosby@.*\.edu" "inbox.loop")
	 ("X-Delivery-Agent" "TMDA" "junk.tmda")
	 ("Delivery-Agent" "TMDA" "junk.tmda")
	 )
	(| ;; LIST RULES
	 ("Sender" "spamassassin-talk-admin@lists.sourceforge.net" "list.dev.spamassassin.talk")
	 ("Sender" "spamassassin-devel-admin@lists.sourceforge.net" "list.dev.spamassassin.dev")
	 ("Delivered-To" "alias-ding@gnus.org" "list.dev.gnus.ding")
	(: my-split-fancy-with-parent)
	(| ;; PROJECT RULES
	(| ;; SENDER RULES
	 ("from" "XXXXXXXXXX" "sender.family.mom")
	 ("from" "YYYYY" "sender.FOOBAR")
	 )
	(| ;; INBOX RULES
	 ;; Dump duplicates into dups. 	  
	 ("Gnus-warning" "duplicat\\(e\\|ion\\) of message" "junk.duplicates")
	 ("X-Spam-Level" "\\*\\*\\*\\*" "inbox.spam")
	 ("Precedence" "bulk\\|list" "inbox.bulk")
	 )
	"inbox.misc"
	)
      )

##

This below doesn't work pre-oorts, but it gives you an idea of what I wanted my group parameters to be. (I do it now by setting topic parameters.)


'(setq gnus-group-parameters
      '(
	(".*"
	 (display . all))
;; Projects and by-sender boxes get responses stored in same.
	("^\\(project|sender\\)\\..*"
         (gcc-self . t))



  parent reply	other threads:[~2002-08-19 14:12 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-08-16 17:10 Paul Graham on fighting SPAM Danny Siu
2002-08-17 19:43 ` Kai Großjohann
2002-08-19  5:44   ` Paul Jarc
2002-08-19  8:53     ` Kai Großjohann
2002-08-21  1:14       ` news
2002-08-27 23:03         ` Nathan J. Williams
2002-08-19 10:50     ` Oliver Scholz
2002-08-19 11:06       ` Kai Großjohann
2002-08-19 14:55         ` Alex Schroeder
2002-08-19 17:09           ` Kai Großjohann
2002-08-19 14:12     ` Scott A Crosby [this message]
2002-09-05 16:00       ` Email filing clemens fischer
2002-12-29 22:35         ` Lars Magne Ingebrigtsen
2002-08-19  9:23 ` Paul Graham on fighting SPAM Alex Schroeder
2002-08-19 11:29   ` Ted Zlatanov
2002-08-19 15:09     ` Alex Schroeder
2002-08-19 16:23       ` Ted Zlatanov
2002-08-19 22:22         ` Alex Schroeder
2002-08-20  7:42           ` Alex Schroeder
2002-08-20 12:00             ` Ted Zlatanov
2002-08-22  2:21               ` Alex Schroeder
2002-08-22 16:32                 ` Ted Zlatanov
2002-08-22 16:57                   ` Ted Zlatanov
2002-08-22 17:57                     ` Kai Großjohann
2002-08-22 18:42                       ` Ted Zlatanov
2002-08-22 19:59                       ` Alex Schroeder
2002-08-22 20:07                     ` Alex Schroeder
2002-08-22 20:54                       ` Ted Zlatanov
2002-08-26 21:55               ` Alex Schroeder
2002-08-26 23:19                 ` Alex Schroeder
2002-08-28  6:40                 ` Piers Cawley
2002-08-28 18:44                   ` Alex Schroeder
2002-08-29  2:46                 ` Ted Zlatanov
2002-08-19 17:09       ` Kai Großjohann
2002-08-19 22:19         ` Alex Schroeder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=oydznvjj5sv.fsf_-_@bert.cs.rice.edu \
    --to=scrosby@cs.rice.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).