Gnus development mailing list
 help / color / mirror / Atom feed
* spam-stat regeneration notes
@ 2003-06-14  0:43 Bill White
  2003-06-14 11:57 ` François Pinard
  2003-06-15 18:38 ` Ted Zlatanov
  0 siblings, 2 replies; 3+ messages in thread
From: Bill White @ 2003-06-14  0:43 UTC (permalink / raw)


I've been using spam-stat for about 6 months now, and noticed lately
that spam processing was getting mightly slow - thanks (I suspect) to
the hashbusters spammers are putting in their messages.  I even do
(spam-stat-reduce-size) when quitting gnus each day, so the thing was
as small as possible.

So today I did my first rebuild of the spam-stat database.  That's not
bad in my book - 6 months for one constantly-growing database.  Here's
the code, which I should probably put in a function "spam-reset" or
something.

----------------------------------------------------------------------
;; Reset:
(spam-stat-reset)

;; Learn spam:
(spam-stat-process-spam-directory "/billw/Mail-2003/spam")

;; Learn non-spam:
(spam-stat-process-non-spam-directory "/billw/Mail-2003/mail/misc/2003/01")
(spam-stat-process-non-spam-directory "/billw/Mail-2003/mail/misc/2003/02")
(spam-stat-process-non-spam-directory "/billw/Mail-2003/mail/misc/2003/03")
(spam-stat-process-non-spam-directory "/billw/Mail-2003/mail/misc/2003/04")
(spam-stat-process-non-spam-directory "/billw/Mail-2003/mail/misc/2003/05")
(spam-stat-process-non-spam-directory "/billw/Mail-2003/mail/misc/2003/06")

;; Reduce table size:
(spam-stat-reduce-size)

;; Save table:
(spam-stat-save)
----------------------------------------------------------------------

The results:

----------------------------------------------------------------------
-rw-rw-r--    1 billw    math      3066100 Jun 13 10:19 .spam-stat.el
-rw-rw-r--    1 billw    math       175109 Jun 13 19:22 .spam-stat.el
----------------------------------------------------------------------

Couple of questions:

- Has anyone else needed to regenerate a Bayesian hash system?

- Is there an easy way to run a function over an entire directory
  tree, while specifying which dirs to include or avoid?

Cheers -

bw
-- 
Bill White . billw@wolfram.com . http://members.wri.com/billw
"No ma'am, we're musicians."




^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: spam-stat regeneration notes
  2003-06-14  0:43 spam-stat regeneration notes Bill White
@ 2003-06-14 11:57 ` François Pinard
  2003-06-15 18:38 ` Ted Zlatanov
  1 sibling, 0 replies; 3+ messages in thread
From: François Pinard @ 2003-06-14 11:57 UTC (permalink / raw)
  Cc: ding

[Bill White]

> I've been using spam-stat for about 6 months now, and noticed lately
> that spam processing was getting mightly slow

My observation as well.  Things which are ever growing are becoming slow
after some amount of time, progressively, insiduously, yet sometimes sooner
than one expects.  But then, the thing might have gotten fairly big, and
rather quite difficult to clean up or reconstruct properly.  At least, this
summarises _years_ of experience with BBDB. :-)

> - Has anyone else needed to regenerate a Bayesian hash system?

I kludged other methods so it is easy for me to do, and I did it once or
twice in the last few months.  One thing is that you acquire experience over
time at properly sorting ham and spam (especially in some border cases), and
your training databases also get more dependable if you "expire" them
somehow.  As a consequence, regenerating your Bayesian system makes it much
better.  It does not take many wrongly filed messages in the training
databases for significantly weakening the capabilities of Bayesian system.

> - Is there an easy way to run a function over an entire directory
>   tree, while specifying which dirs to include or avoid?

GNU `find' maybe? :-)

In my own case, it did not take long to build the tool I needed, which is
pretty aware of all my little habits...  You might tackle this as well?

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: spam-stat regeneration notes
  2003-06-14  0:43 spam-stat regeneration notes Bill White
  2003-06-14 11:57 ` François Pinard
@ 2003-06-15 18:38 ` Ted Zlatanov
  1 sibling, 0 replies; 3+ messages in thread
From: Ted Zlatanov @ 2003-06-15 18:38 UTC (permalink / raw)
  Cc: ding

On Fri, 13 Jun 2003, billw@wolfram.com wrote:
> I've been using spam-stat for about 6 months now, and noticed lately
> that spam processing was getting mightly slow - thanks (I suspect)
> to the hashbusters spammers are putting in their messages.  I even
> do (spam-stat-reduce-size) when quitting gnus each day, so the thing
> was as small as possible.

Maybe we should add code that filters out terms seen only once from
the spam/ham database.  Doing that once a week should work fine.

Also, spam-stat.el creates hashtables without any optimizations: 

(make-hash-table :test 'equal)

maybe that could be improved.  There may be many such optimization
points.  I don't use spam-stat personally, but maybe you can
instrument it and see where the slowdowns are.

> So today I did my first rebuild of the spam-stat database.  That's
> not bad in my book - 6 months for one constantly-growing database.
> Here's the code, which I should probably put in a function
> "spam-reset" or something.

Please do!

> - Is there an easy way to run a function over an entire directory
>   tree, while specifying which dirs to include or avoid?

Maybe find-dired will help?  I don't know about anything like that
built-in...

Ted



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2003-06-15 18:38 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-06-14  0:43 spam-stat regeneration notes Bill White
2003-06-14 11:57 ` François Pinard
2003-06-15 18:38 ` Ted Zlatanov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).