nnml splitting on encoded headers

Gnus development mailing list
 help / color / mirror / Atom feed

* nnml splitting on encoded headers
@ 2002-05-24 20:10 Mark Thomas
  2002-05-25 12:35 ` Mark Thomas
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Mark Thomas @ 2002-05-24 20:10 UTC (permalink / raw)



I'd like to toss away mail that I know I'm not going to be able to
read.

I have split rules:
  ("subject"  "=\\?euc-kr\\?"          "mail.spam.asian")
  ("subject"  "=\\?ks_c_5601-1987\\?"  "mail.spam.asian")
but these don't to work because Gnus has already decoded the messages.

There appears to be no other header to match on; the Content-Type is
multipart/alternative.

Is there any way I can split on the non-decoded Subject header (the
Subject header I see when I C-u g the message)?

Thanks,

-Mark




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nnml splitting on encoded headers
  2002-05-24 20:10 nnml splitting on encoded headers Mark Thomas
@ 2002-05-25 12:35 ` Mark Thomas
  2002-05-25 17:25 ` Kai Großjohann
  2002-05-28 20:45 ` Norman Walsh
  2 siblings, 0 replies; 15+ messages in thread
From: Mark Thomas @ 2002-05-25 12:35 UTC (permalink / raw)


To answer my own question

> Is there any way I can split on the non-decoded Subject header (the
> Subject header I see when I C-u g the message)?

(add-hook 'nnmail-split-hook 'rfc2047-encode-message-header)

-Mark



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nnml splitting on encoded headers
  2002-05-24 20:10 nnml splitting on encoded headers Mark Thomas
  2002-05-25 12:35 ` Mark Thomas
@ 2002-05-25 17:25 ` Kai Großjohann
  2002-05-26  0:00   ` Russ Allbery
  2002-05-28 20:45 ` Norman Walsh
  2 siblings, 1 reply; 15+ messages in thread
From: Kai Großjohann @ 2002-05-25 17:25 UTC (permalink / raw)
  Cc: ding

Mark Thomas <swoon@bellatlantic.net> writes:

> I have split rules:
>   ("subject"  "=\\?euc-kr\\?"          "mail.spam.asian")
>   ("subject"  "=\\?ks_c_5601-1987\\?"  "mail.spam.asian")
> but these don't to work because Gnus has already decoded the messages.

Is this really true?  Maybe fancy splitting matches on word
boundaries only, but "=" is not part of a word, so it can never be
start of a word.  Maybe it works to add .* in front and rear of these
regular expressions.

kai
-- 
Silence is foo!



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nnml splitting on encoded headers
  2002-05-25 17:25 ` Kai Großjohann
@ 2002-05-26  0:00   ` Russ Allbery
  2002-05-26 12:32     ` Mark Thomas
  0 siblings, 1 reply; 15+ messages in thread
From: Russ Allbery @ 2002-05-26  0:00 UTC (permalink / raw)

Kai =?iso-8859-15?q?Gro=DFjohann?= <Kai.Grossjohann@CS.Uni-Dortmund.DE> writes:
> Mark Thomas <swoon@bellatlantic.net> writes:

>> I have split rules:
>>   ("subject"  "=\\?euc-kr\\?"          "mail.spam.asian")
>>   ("subject"  "=\\?ks_c_5601-1987\\?"  "mail.spam.asian")
>> but these don't to work because Gnus has already decoded the messages.

> Is this really true?  Maybe fancy splitting matches on word
> boundaries only, but "=" is not part of a word, so it can never be
> start of a word.  Maybe it works to add .* in front and rear of these
> regular expressions.

Kai's right.  It does work; I use this all the time.  It isn't Gnus
decoding that's stopping this from working, but is instead the fact that
Gnus by default expects word boundaries on either side of the pattern so
you have to add .* before and after.

Incidentally, though, Gnus *does* appear to decode first and then apply
score files.  Is there any way around this so that one can use the same
patterns in score files for groups that are gatewayed mailing lists?

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nnml splitting on encoded headers
  2002-05-26  0:00   ` Russ Allbery
@ 2002-05-26 12:32     ` Mark Thomas
  2002-05-30 22:21       ` Russ Allbery
  0 siblings, 1 reply; 15+ messages in thread
From: Mark Thomas @ 2002-05-26 12:32 UTC (permalink / raw)



Russ Allbery <rra@stanford.edu> writes:
>> Mark Thomas <swoon@bellatlantic.net> writes:
> 
>>> I have split rules:
>>>   ("subject"  "=\\?euc-kr\\?"          "mail.spam.asian")
>>>   ("subject"  "=\\?ks_c_5601-1987\\?"  "mail.spam.asian")
>>> but these don't to work because Gnus has already decoded the
>>> messages.

> Kai's right.  It does work; I use this all the time.  It isn't Gnus
> decoding that's stopping this from working, but is instead the fact
> that Gnus by default expects word boundaries on either side of the
> pattern so you have to add .* before and after.

Yes, these will never match because of the word boundary rule.  (I
don't normally use fancy splitting because I don't like that
restriction.  Unfortunately, gnus-summary-respool-trace doesn't tell
you which regexp matched if you don't use fancy splitting.  In trying
to figure out what was happening, I quickly translated my normal split
rules to fancy rules and I had missed this.)

However, Gnus decodes the headers before running the split rules.
Check out this ChangeLog entry:

    2002-01-26  Lars Magne Ingebrigtsen  <larsi@gnus.org>
    
    	* nnmail.el (nnmail-article-group): Decode headers before running
    	split rules over them.
    	(nnmail-mail-splitting-charset): New variable.

and this snippet of code from nnmail-article-group:

    	;; Decode MIME headers and charsets.
    	(let ((mail-parse-charset nnmail-mail-splitting-charset))
    	  (mail-decode-encoded-word-region (point-min) (point-max)))

Cheers,

-Mark



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nnml splitting on encoded headers
  2002-05-24 20:10 nnml splitting on encoded headers Mark Thomas
  2002-05-25 12:35 ` Mark Thomas
  2002-05-25 17:25 ` Kai Großjohann
@ 2002-05-28 20:45 ` Norman Walsh
  2002-05-28 22:17   ` Mark Thomas
  2002-05-29  7:39   ` Kai Großjohann
  2 siblings, 2 replies; 15+ messages in thread
From: Norman Walsh @ 2002-05-28 20:45 UTC (permalink / raw)
  Cc: ding

/ Mark Thomas <swoon@bellatlantic.net> was heard to say:
| I'd like to toss away mail that I know I'm not going to be able to
| read.
|
| I have split rules:
|   ("subject"  "=\\?euc-kr\\?"          "mail.spam.asian")
|   ("subject"  "=\\?ks_c_5601-1987\\?"  "mail.spam.asian")
| but these don't to work because Gnus has already decoded the messages.

/ Mark Thomas <swoon@bellatlantic.net> was heard to say:
| To answer my own question
|
| > Is there any way I can split on the non-decoded Subject header (the
| > Subject header I see when I C-u g the message)?
|
| (add-hook 'nnmail-split-hook 'rfc2047-encode-message-header)

These messages hint at a feature I would dearly love to have: the
ability to avoid mail in character sets I can't read. But I'm confused
by one part.

Looking at some representative asian spam on my machine, C-u g doesn't
display the encoding in the subject, instead I see things like this:

  Message-Id: <200205281859.OAA05629@nexus.berkshire.net>
  Reply-To: no@kojein.com
  From: ¿¹½º¸Ç<yes@kojein.com>
  To: ndw@nwalsh.com
  Subject: (±¸ÀÎ,±¤°í)ÀçÅÃ¾Ë¹Ù ÇÏ½ÇºÐ 
  Mime-Version: 1.0
  Content-Type: text/html; charset="ks_c_5601-1987"

Do the split rules shown above really match on the charset described
in Content-Type, or is there some other switch I have to enable to
make that appear literally in the subject?

                                        Be seeing you,
                                          norm

-- 
Norman Walsh <ndw@nwalsh.com> | Youth lasts much longer than young
http://nwalsh.com/            | people think.--Comtesse Diane

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nnml splitting on encoded headers
  2002-05-28 20:45 ` Norman Walsh
@ 2002-05-28 22:17   ` Mark Thomas
  2002-05-29  0:31     ` Russ Allbery
  2002-05-29  7:39   ` Kai Großjohann
  1 sibling, 1 reply; 15+ messages in thread
From: Mark Thomas @ 2002-05-28 22:17 UTC (permalink / raw)



On Tue, 28 May 2002, ndw@nwalsh.com wrote:

> Looking at some representative asian spam on my machine, C-u g
> doesn't display the encoding in the subject, instead I see things
> like this:
> 
>   Message-Id: <200205281859.OAA05629@nexus.berkshire.net>
>   Reply-To: no@kojein.com
>   From: ¿¹½º¸Ç<yes@kojein.com>
>   To: ndw@nwalsh.com
>   Subject: (±¸ÀÎ,±¤°í)ÀçÅÃ¾Ë¹Ù ÇÏ½ÇºÐ 
>   Mime-Version: 1.0
>   Content-Type: text/html; charset="ks_c_5601-1987"

For this particular message, I would use rules:
    ("mail.spam.asian"     "^content-type:.*\\beuc-kr\\b")
    ("mail.spam.asian"     "^content-type:.*\\bks_c_5601-1987\\b")
Edit as necessary to make those fancy-rules.  Luckily there was a
charset in the headers for those split rules to match.

Sometimes I get spam where the Content-Type is multipart/alternative
and there is no charset listed in the headers.  For these, I use the
following rule to catch un-encoded spam:
    ("mail.spam.asian"     "^subject:.*[¡-ÿ]\\{4,\\}")
I figure any mail with more than four high-bit characters in a row in
the subject is probably not one I'm going to be able to read.

I've tried to use the rules
    ("mail.spam.asian"     "^subject:.*=\\?euc-kr\\?")
    ("mail.spam.asian"     "^subject:.*=\\?ks_c_5601-1987\\?")
to catch properly encoded headers, but Gnus decodes the message's
headers before it looks at the split rules (at least for back ends that
use nnmail-article-group) so these rules will never match.

Re-encoding the headers with
   (add-hook 'nnmail-split-hook 'rfc2047-encode-message-header)
lets those rules work.  However, this hook also encodes the previously
unencoded headers, so my match on high-bit-characters no longer works.
Sigh.

The number of unencoded Subject headers I receive far outnumber the
encoded ones, so I removed the function from the nnmail-split-hook.
This will work until I get too many properly encoded spams, in which
case I'll just yank the decoding call out of nnmail-article-group.

-Mark



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nnml splitting on encoded headers
  2002-05-28 22:17   ` Mark Thomas
@ 2002-05-29  0:31     ` Russ Allbery
  0 siblings, 0 replies; 15+ messages in thread
From: Russ Allbery @ 2002-05-29  0:31 UTC (permalink / raw)


Mark Thomas <swoon@bellatlantic.net> writes:

> Sometimes I get spam where the Content-Type is multipart/alternative
> and there is no charset listed in the headers.  For these, I use the
> following rule to catch un-encoded spam:
>     ("mail.spam.asian"     "^subject:.*[¡-ÿ]\\{4,\\}")
> I figure any mail with more than four high-bit characters in a row in
> the subject is probably not one I'm going to be able to read.

I've had extremely good luck with the following regex:

    .*[¹²³°¶÷¾].*

It still passes pretty much anything that's ISO 8859-1 or -15, and it
catches unencoded Korean and Cyrillic pretty reliably.  Adjust to taste if
you get unencoded subject headers in character sets other than ISO 8859-1,
of course.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nnml splitting on encoded headers
  2002-05-28 20:45 ` Norman Walsh
  2002-05-28 22:17   ` Mark Thomas
@ 2002-05-29  7:39   ` Kai Großjohann
  1 sibling, 0 replies; 15+ messages in thread
From: Kai Großjohann @ 2002-05-29  7:39 UTC (permalink / raw)
  Cc: Mark Thomas, ding

Norman Walsh <ndw@nwalsh.com> writes:

> Do the split rules shown above really match on the charset described
> in Content-Type, or is there some other switch I have to enable to
> make that appear literally in the subject?

Neither, nor.  Some messages use encodings similar to my From header
in their subject.  The split rules were intended for such messages.

Your spam is 8bit...

kai
-- 
Silence is foo!



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nnml splitting on encoded headers
  2002-05-26 12:32     ` Mark Thomas
@ 2002-05-30 22:21       ` Russ Allbery
  2002-06-03  3:34         ` Jesper Harder
  2002-06-03 17:52         ` Simon Josefsson
  0 siblings, 2 replies; 15+ messages in thread
From: Russ Allbery @ 2002-05-30 22:21 UTC (permalink / raw)

Mark Thomas <swoon@bellatlantic.net> writes:

> However, Gnus decodes the headers before running the split rules.  Check
> out this ChangeLog entry:

>     2002-01-26  Lars Magne Ingebrigtsen  <larsi@gnus.org>

>     	* nnmail.el (nnmail-article-group): Decode headers before running
>     	split rules over them.
>     	(nnmail-mail-splitting-charset): New variable.

> and this snippet of code from nnmail-article-group:

>     	;; Decode MIME headers and charsets.
>     	(let ((mail-parse-charset nnmail-mail-splitting-charset))
>     	  (mail-decode-encoded-word-region (point-min) (point-max)))

Um, that's an extremely serious bug for me.  That means I can't upgrade to
any newer version of Gnus unless there's some way to turn this off, as far
and away the most successful spam filtering rules that I have are those
that catch irregularities of the original, encoded or untagged 8-bit
Subject line.

Telling Gnus to re-encode before split rules apply won't cut it, I
believe, unless that re-encoding leaves raw 8-bit that was originally in
the Subject header alone.

So... how do I turn this feature off?  I can understand how this would be
useful for people who can read other character sets, so I don't want to
see it removed entirely, but it's a serious problem for me.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nnml splitting on encoded headers
  2002-05-30 22:21       ` Russ Allbery
@ 2002-06-03  3:34         ` Jesper Harder
  2002-06-03 17:52         ` Simon Josefsson
  1 sibling, 0 replies; 15+ messages in thread
From: Jesper Harder @ 2002-06-03  3:34 UTC (permalink / raw)


rra@stanford.edu (Russ Allbery) writes:

> Mark Thomas <swoon@bellatlantic.net> writes:
>
>> However, Gnus decodes the headers before running the split rules.  Check
>> out this ChangeLog entry:
>
> So... how do I turn this feature off?  I can understand how this would be
> useful for people who can read other character sets, so I don't want to
> see it removed entirely, but it's a serious problem for me.

You can turn it off with this:

(require 'cl)

(defadvice nnmail-article-group (around rra-dont-decode)
  "Don't decode headers before splitting."
  (flet ((mail-decode-encoded-word-region (start end) nil))
    ad-do-it))

(ad-activate 'nnmail-article-group)


But we should probably have a more obvious way of turning it off.  I
agree that splitting on unencoded headers is useful -- I do it myself in
.procmail.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nnml splitting on encoded headers
  2002-05-30 22:21       ` Russ Allbery
  2002-06-03  3:34         ` Jesper Harder
@ 2002-06-03 17:52         ` Simon Josefsson
  2002-06-03 19:41           ` Kai Großjohann
  1 sibling, 1 reply; 15+ messages in thread
From: Simon Josefsson @ 2002-06-03 17:52 UTC (permalink / raw)
  Cc: ding

Russ Allbery <rra@stanford.edu> writes:

>> and this snippet of code from nnmail-article-group:
>
>>     	;; Decode MIME headers and charsets.
>>     	(let ((mail-parse-charset nnmail-mail-splitting-charset))
>>     	  (mail-decode-encoded-word-region (point-min) (point-max)))
>
> Um, that's an extremely serious bug for me.  That means I can't upgrade to
> any newer version of Gnus unless there's some way to turn this off, as far
> and away the most successful spam filtering rules that I have are those
> that catch irregularities of the original, encoded or untagged 8-bit
> Subject line.
>
> Telling Gnus to re-encode before split rules apply won't cut it, I
> believe, unless that re-encoding leaves raw 8-bit that was originally in
> the Subject header alone.
>
> So... how do I turn this feature off?  I can understand how this would be
> useful for people who can read other character sets, so I don't want to
> see it removed entirely, but it's a serious problem for me.

This patch should make the behaviour customizable.  Does it work?  One
could argue about what the default should be, but Lars made it the
default so I won't change it.

Hm. Perhaps the default really should be off since comparing non-ascii
strings in emacs does not work by default.  The same character in
Latin-1, Latin-9 or Unicode is not regarded as the same by Emacs, so
comparing decoded values doesn't work.  It is also more backwards
compatible.  Opinions?

--- nnmail.el.~6.41.~	Thu May 16 16:51:44 2002
+++ nnmail.el	Mon Jun  3 19:42:26 2002
@@ -484,6 +484,11 @@
   :group 'nnmail
   :type 'symbol)
 
+(defcustom nnmail-mail-splitting-decodes t
+  "Whether the nnmail splitting functionality should MIME decode headers."
+  :group 'nnmail
+  :type 'boolean)
+
 ;;; Internal variables.
 
 (defvar nnmail-article-buffer " *nnmail incoming*"
@@ -1000,8 +1005,9 @@
 	;; Copy the headers into the work buffer.
 	(insert-buffer-substring obuf beg end)
 	;; Decode MIME headers and charsets.
+	(when nnmail-mail-splitting-decodes
 	(let ((mail-parse-charset nnmail-mail-splitting-charset))
-	  (mail-decode-encoded-word-region (point-min) (point-max)))
+	    (mail-decode-encoded-word-region (point-min) (point-max))))
 	;; Fold continuation lines.
 	(goto-char (point-min))
 	(while (re-search-forward "\\(\r?\n[ \t]+\\)+" nil t)
--- gnus.texi.~6.281.~	Thu May 23 21:22:49 2002
+++ gnus.texi	Mon Jun  3 19:50:36 2002
@@ -12439,6 +12439,15 @@
 @code{nnmail-split-header-length-limit} are excluded from the split
 function.
 
+@vindex nnmail-mail-splitting-charset
+@vindex nnmail-mail-splitting-decodes
+By default the splitting codes MIME decodes headers so you can match
+on non-ASCII strings.  The @code{nnmail-mail-splitting-charset}
+variable specifies the default charset for decoding.  The behaviour
+can be turned off completely by binding
+@code{nnmail-mail-splitting-decodes} to nil, which is useful if you
+want to match articles based on the raw header data.
+
 Gnus gives you all the opportunity you could possibly want for shooting
 yourself in the foot.  Let's say you create a group that will contain
 all the mail you get from your boss.  And then you accidentally




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nnml splitting on encoded headers
  2002-06-03 17:52         ` Simon Josefsson
@ 2002-06-03 19:41           ` Kai Großjohann
  2002-06-03 19:48             ` Simon Josefsson
  0 siblings, 1 reply; 15+ messages in thread
From: Kai Großjohann @ 2002-06-03 19:41 UTC (permalink / raw)
  Cc: ding

Simon Josefsson <jas@extundo.com> writes:

> Hm. Perhaps the default really should be off since comparing non-ascii
> strings in emacs does not work by default.  The same character in
> Latin-1, Latin-9 or Unicode is not regarded as the same by Emacs, so
> comparing decoded values doesn't work.  It is also more backwards
> compatible.  Opinions?

Seems to be good.  If people object, the default value can be frobbed
and frobbed...

kai
-- 
Silence is foo!



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nnml splitting on encoded headers
  2002-06-03 19:41           ` Kai Großjohann
@ 2002-06-03 19:48             ` Simon Josefsson
  2002-06-03 20:04               ` Russ Allbery
  0 siblings, 1 reply; 15+ messages in thread
From: Simon Josefsson @ 2002-06-03 19:48 UTC (permalink / raw)
  Cc: ding

Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:

> Simon Josefsson <jas@extundo.com> writes:
>
>> Hm. Perhaps the default really should be off since comparing non-ascii
>> strings in emacs does not work by default.  The same character in
>> Latin-1, Latin-9 or Unicode is not regarded as the same by Emacs, so
>> comparing decoded values doesn't work.  It is also more backwards
>> compatible.  Opinions?
>
> Seems to be good.  If people object, the default value can be frobbed
> and frobbed...

I commited the patch and changed the default.  If the patch didn't
work, or if the default should be changed, holler (or change it).




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nnml splitting on encoded headers
  2002-06-03 19:48             ` Simon Josefsson
@ 2002-06-03 20:04               ` Russ Allbery
  0 siblings, 0 replies; 15+ messages in thread
From: Russ Allbery @ 2002-06-03 20:04 UTC (permalink / raw)


Simon Josefsson <jas@extundo.com> writes:
> Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:
>> Simon Josefsson <jas@extundo.com> writes:

>>> Hm. Perhaps the default really should be off since comparing non-ascii
>>> strings in emacs does not work by default.  The same character in
>>> Latin-1, Latin-9 or Unicode is not regarded as the same by Emacs, so
>>> comparing decoded values doesn't work.  It is also more backwards
>>> compatible.  Opinions?

>> Seems to be good.  If people object, the default value can be frobbed
>> and frobbed...

> I commited the patch and changed the default.  If the patch didn't
> work, or if the default should be changed, holler (or change it).

Thank you!

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2002-06-03 20:04 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-05-24 20:10 nnml splitting on encoded headers Mark Thomas
2002-05-25 12:35 ` Mark Thomas
2002-05-25 17:25 ` Kai Großjohann
2002-05-26  0:00   ` Russ Allbery
2002-05-26 12:32     ` Mark Thomas
2002-05-30 22:21       ` Russ Allbery
2002-06-03  3:34         ` Jesper Harder
2002-06-03 17:52         ` Simon Josefsson
2002-06-03 19:41           ` Kai Großjohann
2002-06-03 19:48             ` Simon Josefsson
2002-06-03 20:04               ` Russ Allbery
2002-05-28 20:45 ` Norman Walsh
2002-05-28 22:17   ` Mark Thomas
2002-05-29  0:31     ` Russ Allbery
2002-05-29  7:39   ` Kai Großjohann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).