From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/46415 Path: main.gmane.org!not-for-mail From: clemens fischer Newsgroups: gmane.emacs.gnus.general Subject: Re: Email filing. Date: Thu, 05 Sep 2002 18:00:36 +0200 Sender: owner-ding@hpc.uh.edu Message-ID: <8z2g8lzv.fsf@spotteswoode.dnsalias.org> References: NNTP-Posting-Host: localhost.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: main.gmane.org 1031255471 1443 127.0.0.1 (5 Sep 2002 19:51:11 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Thu, 5 Sep 2002 19:51:11 +0000 (UTC) Return-path: Original-Received: from malifon.math.uh.edu ([129.7.128.13]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 17n2ei-0000Ms-00 for ; Thu, 05 Sep 2002 21:51:09 +0200 Original-Received: from sina.hpc.uh.edu ([129.7.128.10] ident=lists) by malifon.math.uh.edu with esmtp (Exim 3.20 #1) id 17n2fl-0005h4-00; Thu, 05 Sep 2002 14:52:13 -0500 Original-Received: by sina.hpc.uh.edu (TLB v0.09a (1.20 tibbs 1996/10/09 22:03:07)); Thu, 05 Sep 2002 14:52:48 -0500 (CDT) Original-Received: from sclp3.sclp.com (qmailr@sclp3.sclp.com [209.196.61.66]) by sina.hpc.uh.edu (8.9.3/8.9.3) with SMTP id OAA00890 for ; Thu, 5 Sep 2002 14:52:35 -0500 (CDT) Original-Received: (qmail 10594 invoked by alias); 5 Sep 2002 19:51:56 -0000 Original-Received: (qmail 10589 invoked from network); 5 Sep 2002 19:51:56 -0000 Original-Received: from main.gmane.org (80.91.224.249) by gnus.org with SMTP; 5 Sep 2002 19:51:56 -0000 Original-Received: from root by main.gmane.org with local (Exim 3.35 #1 (Debian)) id 17n2do-0000HK-00 for ; Thu, 05 Sep 2002 21:50:12 +0200 Original-To: ding@gnus.org X-Injected-Via-Gmane: http://gmane.org/ Original-Received: from news by main.gmane.org with local (Exim 3.35 #1 (Debian)) id 17n0nS-0001u5-00 for ; Thu, 05 Sep 2002 19:52:02 +0200 Original-Path: ID-23066.news.dfncis.de!not-for-mail Original-Lines: 39 Original-NNTP-Posting-Host: p3e9baaa5.dip.t-dialin.net Original-X-Trace: main.gmane.org 1031248321 7304 62.155.170.165 (5 Sep 2002 17:52:01 GMT) Original-X-Complaints-To: usenet@main.gmane.org Original-NNTP-Posting-Date: Thu, 5 Sep 2002 17:52:01 +0000 (UTC) User-Agent: Gnus/5.090008 (Oort Gnus v0.08) Emacs/21.2 (i386--freebsd) Cancel-Lock: sha1:emoDAJ+pHOLXiZqGGcY4g4+rSuU= Precedence: list X-Majordomo: 1.94.jlt7 Xref: main.gmane.org gmane.emacs.gnus.general:46415 X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:46415 Scott A Crosby writes: > Been done. Look at ifile. I've heard its slow though. > > And that although it works, it doesn't always work as nicely as you'd > like, in that it classifies right only about 80-90% of the time, but > that 10% will annoy you. IMHO, its only really useful when you have > email thats uncatagorizable by any other means. i have much better numbers with ifile. where it fails it can be attributed to not doing MIME, but /that would slow it!/. mr. browne has proposed a workaround, though, which just identifies the MIME parts and/or encodings. this would make ifile be more accurate in the typical text-group, where it is intended to block spam. then there's always the possibility to mime-decode messages before classifying it. my experiments show that ifile is good, both regarding accuracy and speed, but i do use a tuned system with a procmail preprocessor. the "recipes" don't do any classification, they throw out chinese and spamtool generated garbage. incidentally, procmail uses most of the time needed to categorize my email. > It could try to identify mailing lists by noting list-headers but I > wouldn't want to bet on perfect reliability. it is easy to support this: i have procmail tag messages to mailinglists with a simple "X-Mailinglist: true" header early on, and ifile adjusts nicely, including it in its statistics. > For spam-checking, I'm, doing an implementation of something that does > naive bayesean, but is flexible enough to be used for this. A *very > fast* implementation.... my benchmark right now for the statistics > building is 5 seconds on a 35mb, 7500 message corpus. V2 should be 30% > faster.) sounds very impressive. is it for spam/non-spam checking only? -- clemens