From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.user/961 Path: news.gmane.org!not-for-mail From: Alex Schroeder Newsgroups: gmane.emacs.gnus.user Subject: Re: spam-stat.el 0.0.4 Date: Thu, 22 Aug 2002 04:42:18 +0200 Message-ID: <87ptwb7gxh.fsf@emacswiki.org> References: <873ct78wpr.fsf@emacswiki.org> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1138667815 9110 80.91.229.2 (31 Jan 2006 00:36:55 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 31 Jan 2006 00:36:55 +0000 (UTC) Original-X-From: nobody Tue Jan 17 17:28:24 2006 Original-Newsgroups: gnu.emacs.gnus X-Face: ^BC$`[IcggstLPyen&dqF+b2'zyK#r.mU*'Nms}@&4zw%SJ#5!/7SMVjBS7'lb;QK)|IPU5U'o1'522W4TyzB3Ab*IBo^iw]l4|kUbdZuUDO6=Um-.4IzhNiV'B"@K#jy_(wW|Zbk[34flKY^|PrQ?$u2\fKg^]AY>wOX#H32i Original-Followup-To: gnu.emacs.help User-Agent: Gnus/5.090008 (Oort Gnus v0.08) Emacs/21.2.90 (i686-pc-linux-gnu) Cancel-Lock: sha1:JtQ/X7N4P2RrO8+Rr/oFHf4Ji+Y= Original-NNTP-Posting-Host: 217.162.239.43 Original-X-Trace: news.swissonline.ch 1029984043 217.162.239.43 (22 Aug 2002 04:40:43 +0200) Original-X-Complaints-To: abuse@swissonline.ch Original-Path: quimby.gnus.org!lackawana.kippona.com!news.stealth.net!news.stealth.net!news.maxwell.syr.edu!newsfeed.icl.net!newsfeed.fjserv.net!colt.net!diablo.theplanet.net!newspeer.lavaseals.co.uk!zen.net.uk!news.imp.ch!news.imp.ch!news-zh.switch.ch!news.swissonline.ch!not-for-mail Original-Xref: bridgekeeper.physik.uni-ulm.de gnus-emacs-gnus:1101 Original-Lines: 49 X-Gnus-Article-Number: 1101 Tue Jan 17 17:28:24 2006 Xref: news.gmane.org gmane.emacs.gnus.user:961 Archived-At: Alex Schroeder writes: > ;; I used 323 mails in my spam directory, and 180 mails in my mail > ;; directory. SpamAssassin is installed on my server. The result of > ;; my tests: Testing spam -- 293 / 324 = 0.904321 (10% of the spam > ;; missed), testing non-spam -- 6 / 180 = 0.033333 (3% false > ;; positives!). Ich checked these 6 false positives (this would be a major problem): 1. only one mpg file 2. bounced mail, includes three jpg files 3. only one mp3 file 4. this really is spam 5. this is really not spam at all -- a portuguese mail 6. only one tar file No 5 can be explained by the very small number of portuguese mails in the sample. Here is the debug output. As you can see, the common words "para" and "esta" account for most of the high scores. spam-stat-score-data's value is shown below. Documentation: Raw data used in the last run of `spam-stat-score-buffer'. Defined in `/home/alex/elisp/spam-stat.el'. Value: (("para" 0.99 0.49) ("para" 0.99 0.49) ("esta" 0.99 0.49) ("dia" 0.01 0.49) ("para" 0.99 0.49) ("Sascha" 0.01 0.49) ("schroeder" 0.01 0.49) ("Sascha" 0.01 0.49) ("esta" 0.99 0.49) ("para" 0.99 0.49) ("Hm" 0.01 0.49) ("que" 0.99 0.49) ("que" 0.99 0.49) ("Porto" 0.01 0.49) ("para" 0.99 0.49)) ATM I do not feel too bad about the false positives. I hope that these will disappear with a larger training set.