From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/46177 Path: main.gmane.org!not-for-mail From: Scott A Crosby Newsgroups: gmane.emacs.gnus.general Subject: Email filing. Date: 19 Aug 2002 09:12:48 -0500 Organization: Rice University Sender: owner-ding@hpc.uh.edu Message-ID: References: NNTP-Posting-Host: localhost.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: main.gmane.org 1029766443 14715 127.0.0.1 (19 Aug 2002 14:14:03 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Mon, 19 Aug 2002 14:14:03 +0000 (UTC) Return-path: Original-Received: from malifon.math.uh.edu ([129.7.128.13]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 17gnI7-0003p0-00 for ; Mon, 19 Aug 2002 16:13:59 +0200 Original-Received: from sina.hpc.uh.edu ([129.7.128.10] ident=lists) by malifon.math.uh.edu with esmtp (Exim 3.20 #1) id 17gnHK-0000BQ-00; Mon, 19 Aug 2002 09:13:10 -0500 Original-Received: by sina.hpc.uh.edu (TLB v0.09a (1.20 tibbs 1996/10/09 22:03:07)); Mon, 19 Aug 2002 09:13:41 -0500 (CDT) Original-Received: from sclp3.sclp.com (qmailr@sclp3.sclp.com [209.196.61.66]) by sina.hpc.uh.edu (8.9.3/8.9.3) with SMTP id JAA22917 for ; Mon, 19 Aug 2002 09:13:27 -0500 (CDT) Original-Received: (qmail 19819 invoked by alias); 19 Aug 2002 14:12:50 -0000 Original-Received: (qmail 19814 invoked from network); 19 Aug 2002 14:12:50 -0000 Original-Received: from cs.rice.edu (128.42.1.30) by gnus.org with SMTP; 19 Aug 2002 14:12:50 -0000 Original-Received: from localhost (localhost [127.0.0.1]) by cs.rice.edu (Postfix) with ESMTP id 093C14AA17 for ; Mon, 19 Aug 2002 09:12:50 -0500 (CDT) Original-Received: from bert.cs.rice.edu (bert.cs.rice.edu [128.42.3.146]) by cs.rice.edu (Postfix) with ESMTP id 676704A9C8 for ; Mon, 19 Aug 2002 09:12:49 -0500 (CDT) Original-Received: by bert.cs.rice.edu (Postfix, from userid 14314) id C4BAE374003; Mon, 19 Aug 2002 09:12:48 -0500 (CDT) Original-To: ding@gnus.org In-Reply-To: Original-Lines: 132 User-Agent: Gnus/5.0808 (Gnus v5.8.8) XEmacs/21.4 (Common Lisp) X-Virus-Scanned: by AMaViS snapshot-20020300 Precedence: list X-Majordomo: 1.94.jlt7 Xref: main.gmane.org gmane.emacs.gnus.general:46177 X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:46177 On Mon, 19 Aug 2002 01:44:05 -0400, prj@po.cwru.edu (Paul Jarc) writes: > Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Gro~johann) wrote: > > There is a research field known as "information filtering" or > > "(automatic) text classification" or "text categorization". I don't > > know the details of the theory, but folks in that community are > > speaking of "naive Bayes classifiers" as one of the ways to do it -- > > maybe that's similar to his approach. > > Sounds like it. Anyone know if this (or another) method generalizes > to more than two categories (spam/nonspam)? If so, it could be used Yup. > for all mail splitting. We wouldn't have to manually craft split > rules; we'd just seed a new group with the mails we have so far that Been done. Look at ifile. I've heard its slow though. And that although it works, it doesn't always work as nicely as you'd like, in that it classifies right only about 80-90% of the time, but that 10% will annoy you. IMHO, its only really useful when you have email thats uncatagorizable by any other means. It could try to identify mailing lists by noting list-headers but I wouldn't want to bet on perfect reliability. Personally, I can accept writing mailing list filters right now. Then I sort out some emails by putting it into folders by-author, and finally use gnus-split-with-parent to put project emails with their parents in the folder automatically. [1] This is easy and has perfect reliability. However, if I had gobs of uncatagorizable email (whether mailing list or not), I may consider trying to use such a bayesean classifier.[2] For spam-checking, I'm, doing an implementation of something that does naive bayesean, but is flexible enough to be used for this. A *very fast* implementation.... my benchmark right now for the statistics building is 5 seconds on a 35mb, 7500 message corpus. V2 should be 30% faster.) > belong there, and their contents would let the computer guess which > new mails belong with them. Scott [2] It wouldn't be *too* hard to link in to something like gnus. Whenever a user manually classifies an email into a folder, tack on a header 'manually moved here'. My core, when building statistics, ignores all messages without that header. Yeah, its a brute-scan, but its fast. :) To classify, have a split rule pipe the email to the core which outputs the suggested catagory. [1] Extracts from my .gnus: Among nice features. If I'm in a project/sender folder and send a message, it gets stored there autommatically (by gcc-self). For all non-mailing list emails, they obey my-split-fancy-with-parent. ;; Where to store responses: ;; I use GCC-self as a topic parameter in sender.*/project.*, which ;; are not marked total-expire. For everything else, put it by-date. (setq gnus-message-archive-group '(;; Sent list email messages to list email. ; ("blah.*" "nnml:store_folder") ;Example for constant regexp matches (lambda (x) (cond ;; Store in existing group. (Done above with gcc-self) ; ((string-match "mail.*" group) (concat "nnml:" group)) ; ((string-match "project.*" group) (concat "nnml:" group)) ; ((string-match "sender.*" group) (concat "nnml:" group)) ;; If no match, just dump (t (concat "nnml:sent-mesg." (format-time-string "%Y-%m"))))))) (defun my-split-fancy-with-parent () "Do a split-with-parent, however, ignore the result if it wants to put it in sent-mesg. This way, we put followups in the same group, however, we never put followups into sent-mesg." (let ((my-split (nnmail-split-fancy-with-parent))) ;;(message "Doing-my-split-fancy-with-parent") ;;(message my-split) (if (or (null my-split) (string-match "sent-mesg" my-split)) nil my-split))) (setq nnmail-split-fancy '(| (| ;; Loop/Spam/special rules ("X-Spam-Status" "Yes" "junk.spam") ("Loop" "s?crosby@.*\.edu" "inbox.loop") ("X-Delivery-Agent" "TMDA" "junk.tmda") ("Delivery-Agent" "TMDA" "junk.tmda") ) (| ;; LIST RULES ("Sender" "spamassassin-talk-admin@lists.sourceforge.net" "list.dev.spamassassin.talk") ("Sender" "spamassassin-devel-admin@lists.sourceforge.net" "list.dev.spamassassin.dev") ("Delivered-To" "alias-ding@gnus.org" "list.dev.gnus.ding") (: my-split-fancy-with-parent) (| ;; PROJECT RULES (| ;; SENDER RULES ("from" "XXXXXXXXXX" "sender.family.mom") ("from" "YYYYY" "sender.FOOBAR") ) (| ;; INBOX RULES ;; Dump duplicates into dups. ("Gnus-warning" "duplicat\\(e\\|ion\\) of message" "junk.duplicates") ("X-Spam-Level" "\\*\\*\\*\\*" "inbox.spam") ("Precedence" "bulk\\|list" "inbox.bulk") ) "inbox.misc" ) ) ## This below doesn't work pre-oorts, but it gives you an idea of what I wanted my group parameters to be. (I do it now by setting topic parameters.) '(setq gnus-group-parameters '( (".*" (display . all)) ;; Projects and by-sender boxes get responses stored in same. ("^\\(project|sender\\)\\..*" (gcc-self . t))