* spam-stat and base64 encoded messages @ 2003-06-05 7:01 Oystein Viggen 2003-06-05 20:05 ` Ted Zlatanov 2003-06-06 1:59 ` Jesper Harder 0 siblings, 2 replies; 13+ messages in thread From: Oystein Viggen @ 2003-06-05 7:01 UTC (permalink / raw) Hi Lately, I've been getting a lot of spam where the message is hidden as a base64 encoded text/plain or text/html part. Since spam-stat works on the raw message it doesn't see any spammy words, so it won't classify the message as spam. Would it be possible to add some hack to spam-stat for decoding these text parts? Øystein -- This message was brought to you by the letter ß and the number e. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: spam-stat and base64 encoded messages 2003-06-05 7:01 spam-stat and base64 encoded messages Oystein Viggen @ 2003-06-05 20:05 ` Ted Zlatanov 2003-06-06 2:02 ` Jesper Harder 2003-06-06 1:59 ` Jesper Harder 1 sibling, 1 reply; 13+ messages in thread From: Ted Zlatanov @ 2003-06-05 20:05 UTC (permalink / raw) Cc: ding On Thu, 05 Jun 2003, oysteivi@tihlde.org wrote: > Lately, I've been getting a lot of spam where the message is hidden > as a base64 encoded text/plain or text/html part. Since spam-stat > works on the raw message it doesn't see any spammy words, so it > won't classify the message as spam. > > Would it be possible to add some hack to spam-stat for decoding > these text parts? Maybe the backend or Gnus should optionally decode it before spam-stat ever sees the message in the splitting? Right now it's not done for performance. I don't think spam-stat.el or spam.el should do what logically is not their task. Ted ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: spam-stat and base64 encoded messages 2003-06-05 20:05 ` Ted Zlatanov @ 2003-06-06 2:02 ` Jesper Harder 2003-06-06 3:22 ` Ted Zlatanov 0 siblings, 1 reply; 13+ messages in thread From: Jesper Harder @ 2003-06-06 2:02 UTC (permalink / raw) Ted Zlatanov <tzz@lifelogs.com> writes: > Maybe the backend or Gnus should optionally decode it before > spam-stat ever sees the message in the splitting? Right now it's > not done for performance. I don't think spam-stat.el or spam.el > should do what logically is not their task. At the moment spam-stat.el also reads directly from files, so decoding by the back end wouldn't be enough. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: spam-stat and base64 encoded messages 2003-06-06 2:02 ` Jesper Harder @ 2003-06-06 3:22 ` Ted Zlatanov 2003-06-06 15:30 ` Jesper Harder 0 siblings, 1 reply; 13+ messages in thread From: Ted Zlatanov @ 2003-06-06 3:22 UTC (permalink / raw) On Fri, 06 Jun 2003, harder@myrealbox.com wrote: > Ted Zlatanov <tzz@lifelogs.com> writes: > >> Maybe the backend or Gnus should optionally decode it before >> spam-stat ever sees the message in the splitting? Right now it's >> not done for performance. I don't think spam-stat.el or spam.el >> should do what logically is not their task. > > At the moment spam-stat.el also reads directly from files, so > decoding by the back end wouldn't be enough. You're right. Assuming we don't care about the attachments as entities, but only want to inline them in the message as plain text, what Gnus functionality can I use to do this? Thanks Ted ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: spam-stat and base64 encoded messages 2003-06-06 3:22 ` Ted Zlatanov @ 2003-06-06 15:30 ` Jesper Harder 2003-06-06 23:21 ` Oystein Viggen 0 siblings, 1 reply; 13+ messages in thread From: Jesper Harder @ 2003-06-06 15:30 UTC (permalink / raw) Ted Zlatanov <tzz@lifelogs.com> writes: > On Fri, 06 Jun 2003, harder@myrealbox.com wrote: >> Ted Zlatanov <tzz@lifelogs.com> writes: >> >>> Maybe the backend or Gnus should optionally decode it before >>> spam-stat ever sees the message in the splitting? Right now it's >>> not done for performance. I don't think spam-stat.el or spam.el >>> should do what logically is not their task. >> >> At the moment spam-stat.el also reads directly from files, so >> decoding by the back end wouldn't be enough. > > You're right. Assuming we don't care about the attachments as > entities, but only want to inline them in the message as plain text, > what Gnus functionality can I use to do this? (run-hooks 'gnus-article-decode-hook) does part of the job. Specifically it: * decodes rfc2047-encoded headers. * decodes single-part text/plain QP and Base64 encoded messages. It's probably better than nothing, and as far as I can tell there are no unintended side effects ... but a lot of spam is multipart/* and/or text/html. I don't think there's any existing functionality that does exactly what we want. `gnus-display-mime' is the closest, but it does far too much. You can hack it a bit and wrap some `flet's and `let's around it to make it sort of work, but it's not really the right way (at least without some more work): (require 'cl) (defun my-decode (&optional ihandles) (interactive) (flet ((gnus-treat-article (&rest ignore))) (let ((gnus-summary-buffer (current-buffer)) (mm-text-html-renderer 'mm-inline-text)) (save-excursion (let* ((handles (or ihandles (mm-dissect-buffer nil gnus-article-loose-mime) (and gnus-article-emulate-mime (mm-uu-dissect)))) buffer-read-only handle name type b e display) (when (and (not ihandles) (not gnus-displaying-mime)) ;; Top-level call; we clean up. (when gnus-article-mime-handles (mm-destroy-parts gnus-article-mime-handles) (setq gnus-article-mime-handle-alist nil));; A trick. (setq gnus-article-mime-handles handles)) (if (and handles (or (not (stringp (car handles))) (cdr handles))) (progn (when (and (not ihandles) (not gnus-displaying-mime)) ;; Clean up for mime parts. (article-goto-body) (delete-region (point) (point-max))) (let ((gnus-displaying-mime t)) (gnus-mime-display-part handles))))))))) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: spam-stat and base64 encoded messages 2003-06-06 15:30 ` Jesper Harder @ 2003-06-06 23:21 ` Oystein Viggen 2003-06-09 1:21 ` Jesper Harder 0 siblings, 1 reply; 13+ messages in thread From: Oystein Viggen @ 2003-06-06 23:21 UTC (permalink / raw) * [Jesper Harder] > Ted Zlatanov <tzz@lifelogs.com> writes: > >> You're right. Assuming we don't care about the attachments as >> entities, but only want to inline them in the message as plain text, >> what Gnus functionality can I use to do this? As for html, I think we might as well want to inline it in the message as html code instead of rendering it to plain text. The bayesian filter might benefit from recognizing words like "href" as spammy. (in short, I'd like some code that identifies and decodes any base64 parts but does nothing else to the buffer) > I don't think there's any existing functionality that does exactly > what we want. `gnus-display-mime' is the closest, but it does far > too much. I did some let'ing around gnus-display-mime, but wasn't able to get it to work reliably. As you say, the function does far too much. I also did some experimenting with using article-de-base64-unreadable, which seems the closest to what I wanted. Spam-stat-test-directory with a de-base64-hack added would recognize 1643 messages in my 1737 message spam folder instead of 1622 without the patch. This was with no retraining of spam-stat, so any text previously hidden in base64 can be considered new to spam-stat. Not much of an improvement, but it's measurable. (and still no false positives in my trained ham folders) After retraining with base64 decoding, the number of recognized spams in the folder sank to 1636. Don't really know why.. A small patch: Index: spam-stat.el =================================================================== RCS file: /usr/local/cvsroot/gnus/lisp/spam-stat.el,v retrieving revision 6.12 diff -u -r6.12 spam-stat.el --- spam-stat.el 1 May 2003 14:14:31 -0000 6.12 +++ spam-stat.el 6 Jun 2003 23:28:27 -0000 @@ -229,6 +229,9 @@ (set-buffer (get-buffer-create spam-stat-buffer-name)) (erase-buffer) (insert str) + (ignore-errors + (let ((gnus-original-article-buffer (current-buffer))) + (article-de-base64-unreadable))) (setq spam-stat-buffer (current-buffer))))) (defun spam-stat-store-gnus-article-buffer () @@ -509,6 +512,9 @@ (setq count (1+ count)) (message "Reading %s: %.2f%%" dir (/ count max)) (insert-file-contents f) + (ignore-errors + (let ((gnus-original-article-buffer (current-buffer))) + (article-de-base64-unreadable))) (funcall func) (erase-buffer)))))) @@ -547,6 +553,9 @@ (message "Reading %.2f%%, score %.2f%%" (/ count max) (/ score count)) (insert-file-contents f) + (ignore-errors + (let ((gnus-original-article-buffer (current-buffer))) + (article-de-base64-unreadable))) (when (> (spam-stat-score-buffer) 0.9) (setq score (1+ score))) (erase-buffer)))) Ignore-errors is used to avoid the process choking on malformed base64 and quitting, which would be quite irritating. There's probably a better way to fix this -- Comments are very welcome :) > You can hack it a bit and wrap some `flet's and `let's around it to > make it sort of work, but it's not really the right way (at least > without some more work): > > (require 'cl) > > (defun my-decode (&optional ihandles) Haven't looked at your-decode yet. I'll check it out later when I have time. (hopefully, someone who knows gnus and lisp will beat me to it :) Øystein -- If it ain't broke, don't break it. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: spam-stat and base64 encoded messages 2003-06-06 23:21 ` Oystein Viggen @ 2003-06-09 1:21 ` Jesper Harder 2003-06-09 20:06 ` Ted Zlatanov 0 siblings, 1 reply; 13+ messages in thread From: Jesper Harder @ 2003-06-09 1:21 UTC (permalink / raw) Oystein Viggen <oysteivi@tihlde.org> writes: > I also did some experimenting with using > article-de-base64-unreadable, which seems the closest to what I > wanted. `article-de-base64-unreadable' is too low-level. It works on single-part messages, but not on multipart/* -- which seems to be the way most of my spam is sent. I think that to make this work correctly, you'll need to parse the MIME structure of the message, and then apply the proper decoding to the approriate parts. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: spam-stat and base64 encoded messages 2003-06-09 1:21 ` Jesper Harder @ 2003-06-09 20:06 ` Ted Zlatanov 2003-06-11 19:42 ` Jesper Harder 0 siblings, 1 reply; 13+ messages in thread From: Ted Zlatanov @ 2003-06-09 20:06 UTC (permalink / raw) Cc: John Owens On Mon, 09 Jun 2003, harder@myrealbox.com wrote: > I think that to make this work correctly, you'll need to parse the > MIME structure of the message, and then apply the proper decoding to > the approriate parts. Hmm, are you sure we need to do full MIME parsing? That would slow down the incoming mail splitting a lot, I would think. But I don't know all the Gnus MIME parsing functionality, or how fast it is. See below for more questions. John Owens (cc-ed on this) was asking about forwarded spam messages, which are inside an envelope from SpamAssassin. That's another case where spam-split or spam-stat-split has to do a lot of parsing. Maybe there's a better way? We can invoke spam-split or spam-stat-split on each part of the messages, then if they return t we know it's ham; if they return a string it's spam, and nil means the part was neither. In other words, we don't care about the deep structure, for instance one attachment inside another. We just want to find MIME boundaries, take the text up to the next MIME boundary (even if it includes other MIME boundaries), decode if needed (no decoding should be done on plain text!), and analyze the part. Is that possible already? Referring to the decode-if-needed part, is the gnus-article-decode-hook going to try decoding content even if it's plain text or is there some detection done? If not, spam.el and spam-stat.el or gnus-art.el should do some heuristics. Thanks Ted ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: spam-stat and base64 encoded messages 2003-06-09 20:06 ` Ted Zlatanov @ 2003-06-11 19:42 ` Jesper Harder 2003-08-02 21:17 ` Alex Schroeder 0 siblings, 1 reply; 13+ messages in thread From: Jesper Harder @ 2003-06-11 19:42 UTC (permalink / raw) Cc: John Owens Ted Zlatanov <tzz@lifelogs.com> writes: > On Mon, 09 Jun 2003, harder@myrealbox.com wrote: >> I think that to make this work correctly, you'll need to parse the >> MIME structure of the message, and then apply the proper decoding to >> the approriate parts. > > Hmm, are you sure we need to do full MIME parsing? That would slow > down the incoming mail splitting a lot, I would think. But I don't > know all the Gnus MIME parsing functionality, or how fast it is. text/plain is handled specially [see below] (for performance reasons, I guess) -- so I don't think there's a lot of extra overhead for that case. There's some overhead for other MIME types. But OTOH handling them better could also make it faster, i.e. there's no reason for spam-stat.el to waste time analyzing an attached image (which I think it's doing now). > John Owens (cc-ed on this) was asking about forwarded spam messages, > which are inside an envelope from SpamAssassin. That's another case > where spam-split or spam-stat-split has to do a lot of parsing. Maybe > there's a better way? Try playing with `mm-dissect-buffer' which returns a tree of the MIME structure. > Referring to the decode-if-needed part, is the > gnus-article-decode-hook going to try decoding content even if it's > plain text or is there some detection done? `gnus-article-decode-hook' is for text/plain _only_ (and headers). So yes, it will decode plain text. It's used because text/plain isn't handled by the normal MIME machinery as mentioned above. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: spam-stat and base64 encoded messages 2003-06-11 19:42 ` Jesper Harder @ 2003-08-02 21:17 ` Alex Schroeder 2003-08-04 7:36 ` Adam Sjøgren 0 siblings, 1 reply; 13+ messages in thread From: Alex Schroeder @ 2003-08-02 21:17 UTC (permalink / raw) Personally I don't have many problems with this -- perhaps because of a simple additional rule I am using: All multipart mails without plain text alternative are considered to be spam before spam-stat gets to look at them. Works for me. Alex. -- http://www.emacswiki.org/alex/ I was on holidays from 2003-07-01 to 2003-07-29 and have a lot of catching up to do. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: spam-stat and base64 encoded messages 2003-08-02 21:17 ` Alex Schroeder @ 2003-08-04 7:36 ` Adam Sjøgren 2003-08-08 0:02 ` Alex Schroeder 0 siblings, 1 reply; 13+ messages in thread From: Adam Sjøgren @ 2003-08-04 7:36 UTC (permalink / raw) On Sat, 02 Aug 2003 23:17:56 +0200, Alex wrote: > Personally I don't have many problems with this -- perhaps because > of a simple additional rule I am using: All multipart mails without > plain text alternative are considered to be spam before spam-stat > gets to look at them. Works for me. Sounds like a good rule. Care to share the implementation? Best regards, -- "Hey, maybe they could figure out a way to make me Adam Sjøgren care about how many stop bits I'm using, that'd be asjo@koldfront.dk so retro!" ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: spam-stat and base64 encoded messages 2003-08-04 7:36 ` Adam Sjøgren @ 2003-08-08 0:02 ` Alex Schroeder 0 siblings, 0 replies; 13+ messages in thread From: Alex Schroeder @ 2003-08-08 0:02 UTC (permalink / raw) spamtrap@koldfront.dk (Adam Sjøgren) writes: > Sounds like a good rule. Care to share the implementation? Here is what I use: (setq nnmail-split-fancy `(| ("Gnus-Warning" "This is a duplicate" "mail.spam.duplicates") ;; computer challenged people I know sending me HTML mails ("From" "blablabla" "mail.family") ;; remaining HTML only mail is spam ("Content-Type" "text/html" "mail.spam.filtered") ;; weird character sets are spam, too ("Subject" "=?ks_c_5601-1987" "mail.spam.filtered") ;; spam filtering based on statistics (: spam-stat-split-fancy) ;; now use the BBDB to split -- note that all the groups ;; this splits into must be used as "good" mails for ;; spam-stat! (: (lambda () (car (bbdb/gnus-split-method)))) ;; some of the packages I maintain ("Subject" "\\b\\(color-theme\\|ansi-color\\|sql\\|xemacs\\|emacs\\|spam-stat\\)\\b" "mail.emacs") ("Subject" "\\berc\\b" "mail.emacs.erc") ;; the rest is probably for me "mail.misc")) -- http://www.emacswiki.org/alex/ I was on holidays from 2003-07-01 to 2003-07-29 and have a lot of catching up to do. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: spam-stat and base64 encoded messages 2003-06-05 7:01 spam-stat and base64 encoded messages Oystein Viggen 2003-06-05 20:05 ` Ted Zlatanov @ 2003-06-06 1:59 ` Jesper Harder 1 sibling, 0 replies; 13+ messages in thread From: Jesper Harder @ 2003-06-06 1:59 UTC (permalink / raw) -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Oystein Viggen <oysteivi@tihlde.org> writes: > Lately, I've been getting a lot of spam where the message is hidden as a > base64 encoded text/plain or text/html part. Since spam-stat works on > the raw message it doesn't see any spammy words, so it won't classify > the message as spam. > > Would it be possible to add some hack to spam-stat for decoding these > text parts? Look at the recent subthread starting with <news:m3y914urwq.fsf@defun.localdomain> in dk.edb.system.unix. The solution with inserting (run-hooks 'gnus-article-decode-hook) doesn't have any bad side effects, AFAIK, but only deals with text/plain. The other solution is too hackish and probably needs some more work to work properly in all circumstances. > Øystein > -- > This message was brought to you by the letter ß and the number e. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.0 (GNU/Linux) iD8DBQE+3/WbzEN/MEcathkRAgpxAJ9UmTnl67G+zeiRLWmIosJu3mfTwACeLAj7 exdYQqbfhh2IX3cekgC3U/o= =PV6e -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2003-08-08 0:02 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-06-05 7:01 spam-stat and base64 encoded messages Oystein Viggen 2003-06-05 20:05 ` Ted Zlatanov 2003-06-06 2:02 ` Jesper Harder 2003-06-06 3:22 ` Ted Zlatanov 2003-06-06 15:30 ` Jesper Harder 2003-06-06 23:21 ` Oystein Viggen 2003-06-09 1:21 ` Jesper Harder 2003-06-09 20:06 ` Ted Zlatanov 2003-06-11 19:42 ` Jesper Harder 2003-08-02 21:17 ` Alex Schroeder 2003-08-04 7:36 ` Adam Sjøgren 2003-08-08 0:02 ` Alex Schroeder 2003-06-06 1:59 ` Jesper Harder
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).