From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/46361 Path: main.gmane.org!not-for-mail From: jhbrown@ai.mit.edu (Jeremy H. Brown) Newsgroups: gmane.emacs.gnus.general Subject: Re: Using Eric Raymond's bogofilter tool within Gnus Date: 03 Sep 2002 10:13:59 -0400 Sender: owner-ding@hpc.uh.edu Message-ID: References: NNTP-Posting-Host: localhost.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: main.gmane.org 1031063677 20971 127.0.0.1 (3 Sep 2002 14:34:37 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Tue, 3 Sep 2002 14:34:37 +0000 (UTC) Cc: Matthias Andree , Forum of ding/Gnus users Return-path: Original-Received: from malifon.math.uh.edu ([129.7.128.13]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 17mElH-0005S7-00 for ; Tue, 03 Sep 2002 16:34:35 +0200 Original-Received: from sina.hpc.uh.edu ([129.7.128.10] ident=lists) by malifon.math.uh.edu with esmtp (Exim 3.20 #1) id 17mElR-0000lY-00; Tue, 03 Sep 2002 09:34:45 -0500 Original-Received: by sina.hpc.uh.edu (TLB v0.09a (1.20 tibbs 1996/10/09 22:03:07)); Tue, 03 Sep 2002 09:35:20 -0500 (CDT) Original-Received: from epithumia.math.uh.edu (epithumia.math.uh.edu [129.7.128.2]) by sina.hpc.uh.edu (8.9.3/8.9.3) with ESMTP id JAA25571 for ; Tue, 3 Sep 2002 09:35:14 -0500 (CDT) Original-Received: (from tibbs@localhost) by epithumia.math.uh.edu (8.11.2/8.11.1) id g83EYYd18174 for ding@hpc.uh.edu; Tue, 3 Sep 2002 09:34:34 -0500 Original-Received: from sclp3.sclp.com (qmailr@sclp3.sclp.com [209.196.61.66]) by sina.hpc.uh.edu (8.9.3/8.9.3) with SMTP id JAA25493 for ; Tue, 3 Sep 2002 09:17:55 -0500 (CDT) Original-Received: (qmail 28901 invoked by alias); 3 Sep 2002 14:17:15 -0000 Original-Received: (qmail 28896 invoked from network); 3 Sep 2002 14:17:15 -0000 Original-Received: from life.ai.mit.edu (128.52.32.80) by gnus.org with SMTP; 3 Sep 2002 14:17:15 -0000 Original-Received: from suspiria.ai.mit.edu (suspiria [128.52.39.94]) by life.ai.mit.edu (8.12.2/8.12.2/BASENAME(ai.master.life-8.12.2.mc,.mc):RCS_REVISION(evision: 1.23 ) with ESMTP id g83EE11q004695; Tue, 3 Sep 2002 10:14:01 -0400 (EDT) Original-Received: by suspiria.ai.mit.edu (8.8.8/1.1.19.2/08Jul98-0847PM) id KAA0000008215; Tue, 3 Sep 2002 10:13:59 -0400 (EDT) Original-To: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) In-Reply-To: Original-Lines: 56 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 Precedence: list X-Majordomo: 1.94.jlt7 Xref: main.gmane.org gmane.emacs.gnus.general:46361 X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:46361 pinard@iro.umontreal.ca (Fran=E7ois Pinard) writes: >Let me thank you for the two references above. Here are other > references I have on Bayes filtering.=20=20 Let me second those thanks, and thank you as well! With as many projects as you've found, it'd be fun to have a running head-to-head classifier competition. > . @ http://www.ai.mit.edu/~jrennie/ifile/ > . @ http://www.ai.mit.edu/~jhbrown/ifile-gnus.html I've been playing with ifile a great deal lately (thus leading to ifile-gnus). A product review: The good: ifile is spectacularly accurate at classifying spam vs. non-spam. I love it.=20=20 You can also use ifile as a more general classifier so that it will learn to file your mail into arbitrary groups; it is reasonably accurate at that (about 85% according to a paper the ifile author wrote, which jives with my experience.) The bad: 85% accurate classification isn't good enough to use across the board. (Although it's a reasonable way of splitting mail that your split-rules missed and would otherwise get dumped in your "misc" group.) Performance isn't stellar. This is mostly because the database is stored in a large flat text file. With two classes (spam, non-spam) it's usable with a DB around 200KB; with more classes, the database rapidly heads towards a meg or so, and ifile becomes almost unusably slow due to startup overhead. (Caveat: I'm running it on a slow computer, with the database stored in NFS, so I'm pretty much the worst-case imaginable. I have specific performance numbers; if anyone wants to them, contact me personally.) The ugly: Given a bunch of messages to classify or learn from all at once, ifile parses them all into memory before moving onto the next step; if your mailboxes are as out of control as mine, this will cause ifile to run out of memory and lose. I don't think there're any fundamental reasons that this couldn't be fixed. I have dreams of doing this, and making ifile run as a daemon to avoid the db-startup overhead. I'd love to see more reviews of bayesian (or other) spamfilters; it'd be trivial to mutate ifile-gnus to use just about any command-line drivable filter. Jeremy