Gnus development mailing list
 help / color / mirror / Atom feed
* spam-stat and base64 encoded messages
@ 2003-06-05  7:01 Oystein Viggen
  2003-06-05 20:05 ` Ted Zlatanov
  2003-06-06  1:59 ` Jesper Harder
  0 siblings, 2 replies; 13+ messages in thread
From: Oystein Viggen @ 2003-06-05  7:01 UTC (permalink / raw)


Hi

Lately, I've been getting a lot of spam where the message is hidden as a
base64 encoded text/plain or text/html part.  Since spam-stat works on
the raw message it doesn't see any spammy words, so it won't classify
the message as spam.

Would it be possible to add some hack to spam-stat for decoding these
text parts?

Øystein
-- 
This message was brought to you by the letter ß and the number e.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: spam-stat and base64 encoded messages
  2003-06-05  7:01 spam-stat and base64 encoded messages Oystein Viggen
@ 2003-06-05 20:05 ` Ted Zlatanov
  2003-06-06  2:02   ` Jesper Harder
  2003-06-06  1:59 ` Jesper Harder
  1 sibling, 1 reply; 13+ messages in thread
From: Ted Zlatanov @ 2003-06-05 20:05 UTC (permalink / raw)
  Cc: ding

On Thu, 05 Jun 2003, oysteivi@tihlde.org wrote:
> Lately, I've been getting a lot of spam where the message is hidden
> as a base64 encoded text/plain or text/html part.  Since spam-stat
> works on the raw message it doesn't see any spammy words, so it
> won't classify the message as spam.
> 
> Would it be possible to add some hack to spam-stat for decoding
> these text parts?

Maybe the backend or Gnus should optionally decode it before spam-stat
ever sees the message in the splitting?  Right now it's not done for
performance.  I don't think spam-stat.el or spam.el should do what
logically is not their task.

Ted



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: spam-stat and base64 encoded messages
  2003-06-05  7:01 spam-stat and base64 encoded messages Oystein Viggen
  2003-06-05 20:05 ` Ted Zlatanov
@ 2003-06-06  1:59 ` Jesper Harder
  1 sibling, 0 replies; 13+ messages in thread
From: Jesper Harder @ 2003-06-06  1:59 UTC (permalink / raw)


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Oystein Viggen <oysteivi@tihlde.org> writes:

> Lately, I've been getting a lot of spam where the message is hidden as a
> base64 encoded text/plain or text/html part.  Since spam-stat works on
> the raw message it doesn't see any spammy words, so it won't classify
> the message as spam.
>
> Would it be possible to add some hack to spam-stat for decoding these
> text parts?

Look at the recent subthread starting with
<news:m3y914urwq.fsf@defun.localdomain> in dk.edb.system.unix.  

The solution with inserting (run-hooks 'gnus-article-decode-hook)
doesn't have any bad side effects, AFAIK, but only deals with
text/plain.

The other solution is too hackish and probably needs some more work to
work properly in all circumstances.


> Øystein
> -- 
> This message was brought to you by the letter ß and the number e.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)

iD8DBQE+3/WbzEN/MEcathkRAgpxAJ9UmTnl67G+zeiRLWmIosJu3mfTwACeLAj7
exdYQqbfhh2IX3cekgC3U/o=
=PV6e
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: spam-stat and base64 encoded messages
  2003-06-05 20:05 ` Ted Zlatanov
@ 2003-06-06  2:02   ` Jesper Harder
  2003-06-06  3:22     ` Ted Zlatanov
  0 siblings, 1 reply; 13+ messages in thread
From: Jesper Harder @ 2003-06-06  2:02 UTC (permalink / raw)


Ted Zlatanov <tzz@lifelogs.com> writes:

> Maybe the backend or Gnus should optionally decode it before
> spam-stat ever sees the message in the splitting?  Right now it's
> not done for performance.  I don't think spam-stat.el or spam.el
> should do what logically is not their task.

At the moment spam-stat.el also reads directly from files, so decoding
by the back end wouldn't be enough.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: spam-stat and base64 encoded messages
  2003-06-06  2:02   ` Jesper Harder
@ 2003-06-06  3:22     ` Ted Zlatanov
  2003-06-06 15:30       ` Jesper Harder
  0 siblings, 1 reply; 13+ messages in thread
From: Ted Zlatanov @ 2003-06-06  3:22 UTC (permalink / raw)


On Fri, 06 Jun 2003, harder@myrealbox.com wrote:
> Ted Zlatanov <tzz@lifelogs.com> writes:
> 
>> Maybe the backend or Gnus should optionally decode it before
>> spam-stat ever sees the message in the splitting?  Right now it's
>> not done for performance.  I don't think spam-stat.el or spam.el
>> should do what logically is not their task.
> 
> At the moment spam-stat.el also reads directly from files, so
> decoding by the back end wouldn't be enough.

You're right.  Assuming we don't care about the attachments as
entities, but only want to inline them in the message as plain text,
what Gnus functionality can I use to do this?

Thanks
Ted



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: spam-stat and base64 encoded messages
  2003-06-06  3:22     ` Ted Zlatanov
@ 2003-06-06 15:30       ` Jesper Harder
  2003-06-06 23:21         ` Oystein Viggen
  0 siblings, 1 reply; 13+ messages in thread
From: Jesper Harder @ 2003-06-06 15:30 UTC (permalink / raw)


Ted Zlatanov <tzz@lifelogs.com> writes:

> On Fri, 06 Jun 2003, harder@myrealbox.com wrote:
>> Ted Zlatanov <tzz@lifelogs.com> writes:
>> 
>>> Maybe the backend or Gnus should optionally decode it before
>>> spam-stat ever sees the message in the splitting?  Right now it's
>>> not done for performance.  I don't think spam-stat.el or spam.el
>>> should do what logically is not their task.
>> 
>> At the moment spam-stat.el also reads directly from files, so
>> decoding by the back end wouldn't be enough.
>
> You're right.  Assuming we don't care about the attachments as
> entities, but only want to inline them in the message as plain text,
> what Gnus functionality can I use to do this?

   (run-hooks 'gnus-article-decode-hook)

does part of the job.  Specifically it:

* decodes rfc2047-encoded headers.
* decodes single-part text/plain QP and Base64 encoded messages.

It's probably better than nothing, and as far as I can tell there are
no unintended side effects ... but a lot of spam is multipart/* and/or
text/html.

I don't think there's any existing functionality that does exactly
what we want.  `gnus-display-mime' is the closest, but it does far
too much.  

You can hack it a bit and wrap some `flet's and `let's around it to
make it sort of work, but it's not really the right way (at least
without some more work):

(require 'cl)

(defun my-decode (&optional ihandles)
  (interactive)
  (flet ((gnus-treat-article (&rest ignore)))
    (let ((gnus-summary-buffer (current-buffer))
	  (mm-text-html-renderer 'mm-inline-text))
      (save-excursion
	(let* ((handles (or ihandles
			    (mm-dissect-buffer nil gnus-article-loose-mime)
			    (and gnus-article-emulate-mime
				 (mm-uu-dissect))))
	       buffer-read-only handle name type b e display)
	  (when (and (not ihandles)
		     (not gnus-displaying-mime))
	    ;; Top-level call; we clean up.
	    (when gnus-article-mime-handles
	      (mm-destroy-parts gnus-article-mime-handles)
	      (setq gnus-article-mime-handle-alist nil));; A trick.
	    (setq gnus-article-mime-handles handles))
	  (if (and handles
		   (or (not (stringp (car handles)))
		       (cdr handles)))
	      (progn
		(when (and (not ihandles)
			   (not gnus-displaying-mime))
		  ;; Clean up for mime parts.
		  (article-goto-body)
		  (delete-region (point) (point-max)))
		(let ((gnus-displaying-mime t))
		  (gnus-mime-display-part handles)))))))))



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: spam-stat and base64 encoded messages
  2003-06-06 15:30       ` Jesper Harder
@ 2003-06-06 23:21         ` Oystein Viggen
  2003-06-09  1:21           ` Jesper Harder
  0 siblings, 1 reply; 13+ messages in thread
From: Oystein Viggen @ 2003-06-06 23:21 UTC (permalink / raw)


* [Jesper Harder] 

> Ted Zlatanov <tzz@lifelogs.com> writes:
>
>> You're right.  Assuming we don't care about the attachments as
>> entities, but only want to inline them in the message as plain text,
>> what Gnus functionality can I use to do this?

As for html, I think we might as well want to inline it in the message
as html code instead of rendering it to plain text.  The bayesian filter
might benefit from recognizing words like "href" as spammy.  (in short,
I'd like some code that identifies and decodes any base64 parts but does
nothing else to the buffer)

> I don't think there's any existing functionality that does exactly
> what we want.  `gnus-display-mime' is the closest, but it does far
> too much.  

I did some let'ing around gnus-display-mime, but wasn't able to get it
to work reliably.  As you say, the function does far too much.

I also did some experimenting with using article-de-base64-unreadable,
which seems the closest to what I wanted.  Spam-stat-test-directory with
a de-base64-hack added would recognize 1643 messages in my 1737 message
spam folder instead of 1622 without the patch.  This was with no
retraining of spam-stat, so any text previously hidden in base64 can be
considered new to spam-stat.  Not much of an improvement, but it's
measurable.  (and still no false positives in my trained ham folders)

After retraining with base64 decoding, the number of recognized spams in
the folder sank to 1636.  Don't really know why..

A small patch:

Index: spam-stat.el
===================================================================
RCS file: /usr/local/cvsroot/gnus/lisp/spam-stat.el,v
retrieving revision 6.12
diff -u -r6.12 spam-stat.el
--- spam-stat.el	1 May 2003 14:14:31 -0000	6.12
+++ spam-stat.el	6 Jun 2003 23:28:27 -0000
@@ -229,6 +229,9 @@
       (set-buffer (get-buffer-create spam-stat-buffer-name))
       (erase-buffer)
       (insert str)
+      (ignore-errors 
+	(let ((gnus-original-article-buffer (current-buffer)))
+	  (article-de-base64-unreadable)))
       (setq spam-stat-buffer (current-buffer)))))
 
 (defun spam-stat-store-gnus-article-buffer ()
@@ -509,6 +512,9 @@
 	  (setq count (1+ count))
 	  (message "Reading %s: %.2f%%" dir (/ count max))
 	  (insert-file-contents f)
+	  (ignore-errors 
+	    (let ((gnus-original-article-buffer (current-buffer)))
+	      (article-de-base64-unreadable)))
 	  (funcall func)
 	  (erase-buffer))))))
 
@@ -547,6 +553,9 @@
 	  (message "Reading %.2f%%, score %.2f%%"
 		   (/ count max) (/ score count))
 	  (insert-file-contents f)
+	  (ignore-errors 
+	    (let ((gnus-original-article-buffer (current-buffer)))
+	      (article-de-base64-unreadable)))
 	  (when (> (spam-stat-score-buffer) 0.9)
 	    (setq score (1+ score)))
 	  (erase-buffer))))


Ignore-errors is used to avoid the process choking on malformed base64
and quitting, which would be quite irritating.  There's probably a
better way to fix this -- Comments are very welcome  :)

> You can hack it a bit and wrap some `flet's and `let's around it to
> make it sort of work, but it's not really the right way (at least
> without some more work):
>
> (require 'cl)
>
> (defun my-decode (&optional ihandles)

Haven't looked at your-decode yet.  I'll check it out later when I have
time.  (hopefully, someone who knows gnus and lisp will beat me to it :)

Øystein
-- 
If it ain't broke, don't break it.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: spam-stat and base64 encoded messages
  2003-06-06 23:21         ` Oystein Viggen
@ 2003-06-09  1:21           ` Jesper Harder
  2003-06-09 20:06             ` Ted Zlatanov
  0 siblings, 1 reply; 13+ messages in thread
From: Jesper Harder @ 2003-06-09  1:21 UTC (permalink / raw)


Oystein Viggen <oysteivi@tihlde.org> writes:

> I also did some experimenting with using
> article-de-base64-unreadable, which seems the closest to what I
> wanted.

`article-de-base64-unreadable' is too low-level.  It works on
single-part messages, but not on multipart/* -- which seems to be the
way most of my spam is sent.

I think that to make this work correctly, you'll need to parse the
MIME structure of the message, and then apply the proper decoding to
the approriate parts.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: spam-stat and base64 encoded messages
  2003-06-09  1:21           ` Jesper Harder
@ 2003-06-09 20:06             ` Ted Zlatanov
  2003-06-11 19:42               ` Jesper Harder
  0 siblings, 1 reply; 13+ messages in thread
From: Ted Zlatanov @ 2003-06-09 20:06 UTC (permalink / raw)
  Cc: John Owens

On Mon, 09 Jun 2003, harder@myrealbox.com wrote:
> I think that to make this work correctly, you'll need to parse the
> MIME structure of the message, and then apply the proper decoding to
> the approriate parts.

Hmm, are you sure we need to do full MIME parsing?  That would slow
down the incoming mail splitting a lot, I would think.  But I don't
know all the Gnus MIME parsing functionality, or how fast it is.  See
below for more questions.

John Owens (cc-ed on this) was asking about forwarded spam messages,
which are inside an envelope from SpamAssassin.  That's another case
where spam-split or spam-stat-split has to do a lot of parsing.  Maybe
there's a better way?  

We can invoke spam-split or spam-stat-split on each part of the
messages, then if they return t we know it's ham; if they return a
string it's spam, and nil means the part was neither.  In other words,
we don't care about the deep structure, for instance one attachment
inside another.  We just want to find MIME boundaries, take the text
up to the next MIME boundary (even if it includes other MIME
boundaries), decode if needed (no decoding should be done on plain
text!), and analyze the part.  Is that possible already?

Referring to the decode-if-needed part, is the
gnus-article-decode-hook going to try decoding content even if it's
plain text or is there some detection done?  If not, spam.el and
spam-stat.el or gnus-art.el should do some heuristics.

Thanks
Ted



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: spam-stat and base64 encoded messages
  2003-06-09 20:06             ` Ted Zlatanov
@ 2003-06-11 19:42               ` Jesper Harder
  2003-08-02 21:17                 ` Alex Schroeder
  0 siblings, 1 reply; 13+ messages in thread
From: Jesper Harder @ 2003-06-11 19:42 UTC (permalink / raw)
  Cc: John Owens

Ted Zlatanov <tzz@lifelogs.com> writes:

> On Mon, 09 Jun 2003, harder@myrealbox.com wrote:
>> I think that to make this work correctly, you'll need to parse the
>> MIME structure of the message, and then apply the proper decoding to
>> the approriate parts.
>
> Hmm, are you sure we need to do full MIME parsing?  That would slow
> down the incoming mail splitting a lot, I would think.  But I don't
> know all the Gnus MIME parsing functionality, or how fast it is.

text/plain is handled specially [see below] (for performance reasons,
I guess) -- so I don't think there's a lot of extra overhead for that
case.

There's some overhead for other MIME types.  But OTOH handling them
better could also make it faster, i.e. there's no reason for
spam-stat.el to waste time analyzing an attached image (which I think
it's doing now).

> John Owens (cc-ed on this) was asking about forwarded spam messages,
> which are inside an envelope from SpamAssassin.  That's another case
> where spam-split or spam-stat-split has to do a lot of parsing.  Maybe
> there's a better way?  

Try playing with `mm-dissect-buffer' which returns a tree of the MIME
structure.

> Referring to the decode-if-needed part, is the
> gnus-article-decode-hook going to try decoding content even if it's
> plain text or is there some detection done?

`gnus-article-decode-hook' is for text/plain _only_ (and headers).  So
yes, it will decode plain text.  It's used because text/plain isn't
handled by the normal MIME machinery as mentioned above.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: spam-stat and base64 encoded messages
  2003-06-11 19:42               ` Jesper Harder
@ 2003-08-02 21:17                 ` Alex Schroeder
  2003-08-04  7:36                   ` Adam Sjøgren
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Schroeder @ 2003-08-02 21:17 UTC (permalink / raw)


Personally I don't have many problems with this -- perhaps because of
a simple additional rule I am using:  All multipart mails without
plain text alternative are considered to be spam before spam-stat
gets to look at them.  Works for me.

Alex.
-- 
http://www.emacswiki.org/alex/
I was on holidays from 2003-07-01 to 2003-07-29
and have a lot of catching up to do.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: spam-stat and base64 encoded messages
  2003-08-02 21:17                 ` Alex Schroeder
@ 2003-08-04  7:36                   ` Adam Sjøgren
  2003-08-08  0:02                     ` Alex Schroeder
  0 siblings, 1 reply; 13+ messages in thread
From: Adam Sjøgren @ 2003-08-04  7:36 UTC (permalink / raw)


On Sat, 02 Aug 2003 23:17:56 +0200, Alex wrote:

> Personally I don't have many problems with this -- perhaps because
> of a simple additional rule I am using: All multipart mails without
> plain text alternative are considered to be spam before spam-stat
> gets to look at them.  Works for me.

Sounds like a good rule. Care to share the implementation?


  Best regards,

-- 
 "Hey, maybe they could figure out a way to make me           Adam Sjøgren
  care about how many stop bits I'm using, that'd be     asjo@koldfront.dk
  so retro!"




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: spam-stat and base64 encoded messages
  2003-08-04  7:36                   ` Adam Sjøgren
@ 2003-08-08  0:02                     ` Alex Schroeder
  0 siblings, 0 replies; 13+ messages in thread
From: Alex Schroeder @ 2003-08-08  0:02 UTC (permalink / raw)


spamtrap@koldfront.dk (Adam Sjøgren) writes:

> Sounds like a good rule. Care to share the implementation?

Here is what I use:

(setq nnmail-split-fancy
      `(| ("Gnus-Warning" "This is a duplicate" "mail.spam.duplicates")
	  ;; computer challenged people I know sending me HTML mails
	  ("From" "blablabla" "mail.family")
	  ;; remaining HTML only mail is spam
	  ("Content-Type" "text/html" "mail.spam.filtered")
	  ;; weird character sets are spam, too
	  ("Subject" "=?ks_c_5601-1987" "mail.spam.filtered")
	  ;; spam filtering based on statistics
	  (: spam-stat-split-fancy)
	  ;; now use the BBDB to split -- note that all the groups
	  ;; this splits into must be used as "good" mails for
	  ;; spam-stat!
	  (: (lambda ()
	       (car (bbdb/gnus-split-method))))
	  ;; some of the packages I maintain
	  ("Subject" "\\b\\(color-theme\\|ansi-color\\|sql\\|xemacs\\|emacs\\|spam-stat\\)\\b" "mail.emacs")
	  ("Subject" "\\berc\\b" "mail.emacs.erc")
	  ;; the rest is probably for me
	  "mail.misc"))

-- 
http://www.emacswiki.org/alex/
I was on holidays from 2003-07-01 to 2003-07-29
and have a lot of catching up to do.



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2003-08-08  0:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-06-05  7:01 spam-stat and base64 encoded messages Oystein Viggen
2003-06-05 20:05 ` Ted Zlatanov
2003-06-06  2:02   ` Jesper Harder
2003-06-06  3:22     ` Ted Zlatanov
2003-06-06 15:30       ` Jesper Harder
2003-06-06 23:21         ` Oystein Viggen
2003-06-09  1:21           ` Jesper Harder
2003-06-09 20:06             ` Ted Zlatanov
2003-06-11 19:42               ` Jesper Harder
2003-08-02 21:17                 ` Alex Schroeder
2003-08-04  7:36                   ` Adam Sjøgren
2003-08-08  0:02                     ` Alex Schroeder
2003-06-06  1:59 ` Jesper Harder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).