More liberal MIME decoding (unencoded question marks in encoded words)

Gnus development mailing list
 help / color / mirror / Atom feed

* More liberal MIME decoding (unencoded question marks in encoded words)
@ 2007-11-24 13:29 Reiner Steib
  2007-11-26 12:31 ` Katsumi Yamaoka
  0 siblings, 1 reply; 6+ messages in thread
From: Reiner Steib @ 2007-11-24 13:29 UTC (permalink / raw)
  To: ding

Hi,

I see more an more incorrectly encoded subjects like this:

,----
| Subject: =?ISO-8859-1?Q?bequeme_Index-Eintr=E4ge_mit_TexnicCenter??=
| Organization: http://groups.google.com
| Message-ID: <78a8858d-463a-4a24-b18c-d0579ef60be9@s19g2000prg.googlegroups.com>
| User-Agent: G2/1.0
`----

The mistake on Google's side is not to encode the (trailing) question
mark.  How can we make Gnus' decoder more liberal?

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More liberal MIME decoding (unencoded question marks in encoded words)
  2007-11-24 13:29 More liberal MIME decoding (unencoded question marks in encoded words) Reiner Steib
@ 2007-11-26 12:31 ` Katsumi Yamaoka
  2007-11-26 22:08   ` Reiner Steib
  0 siblings, 1 reply; 6+ messages in thread
From: Katsumi Yamaoka @ 2007-11-26 12:31 UTC (permalink / raw)
  To: ding

[-- Attachment #1: Type: text/plain, Size: 863 bytes --]

>>>>> Reiner Steib wrote:

> I see more an more incorrectly encoded subjects like this:

> ,----
>| Subject: =?ISO-8859-1?Q?bequeme_Index-Eintr=E4ge_mit_TexnicCenter??=
>| Organization: http://groups.google.com
>| Message-ID: <78a8858d-463a-4a24-b18c-d0579ef60be9@s19g2000prg.googlegroups.com>
>| User-Agent: G2/1.0
> `----

> The mistake on Google's side is not to encode the (trailing) question
> mark.  How can we make Gnus' decoder more liberal?

Maybe the patch below does it but we must check it thoroughly.
Would we be able to make complete test cases?

(rfc2047-decode-string "=?ISO-8859-1?Q??foo?=")
"?foo"
(rfc2047-decode-string "=?ISO-8859-1?Q?=foo?=")
"=foo"
(rfc2047-decode-string "=?ISO-8859-1?Q?foo??=")
"foo?"
(rfc2047-decode-string "=?ISO-8859-1?Q?foo?=?=")
"foo?="
(rfc2047-decode-string "=?ISO-8859-1?Q?foo?==?ISO-8859-1?Q?bar?=")
"foobar"
...

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 912 bytes --]

--- rfc2047.el~	2007-10-04 21:53:36 +0000
+++ rfc2047.el	2007-11-26 12:25:07 +0000
@@ -827,8 +827,10 @@
 
 (eval-and-compile
   (defconst rfc2047-encoded-word-regexp
-    "=\\?\\([^][\000-\040()<>@,\;:*\\\"/?.=]+\\)\\(?:\\*[^?]+\\)?\
-\\?\\(B\\|Q\\)\\?\\([!->@-~ ]*\\)\\?="))
+    "=\\?\\([^][\000-\040()<>@,\;:*\\\"/?.=]+\\)\\(?:\\*[^?]+\\)?\\?\
+\\(B\\?[+/0-9A-Za-z]*=*\
+\\|Q\\?\\(?:\\?+[ -<>@-~]\\)?\\(?:[ ->@-~]+\\?+[ -<>@-~]\\)*[ ->@-~]*\\?*\
+\\)\\?="))
 
 (defvar rfc2047-quote-decoded-words-containing-tspecials nil
   "If non-nil, quote decoded words containing special characters.")
@@ -967,7 +969,7 @@
 	  (while match
 	    (push (list (match-string 2) ;; charset
 			(char-after (match-beginning 3)) ;; encoding
-			(match-string 4) ;; encoded-text
+			(substring (match-string 3) 2) ;; encoded-text
 			(match-string 1)) ;; encoded-word
 		  words)
 	    ;; Look for the subsequent encoded-words.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More liberal MIME decoding (unencoded question marks in encoded words)
  2007-11-26 12:31 ` Katsumi Yamaoka
@ 2007-11-26 22:08   ` Reiner Steib
  2007-11-27  9:34     ` Katsumi Yamaoka
  0 siblings, 1 reply; 6+ messages in thread
From: Reiner Steib @ 2007-11-26 22:08 UTC (permalink / raw)
  To: ding

On Mon, Nov 26 2007, Katsumi Yamaoka wrote:

>>>>>> Reiner Steib wrote:
>> The mistake on Google's side is not to encode the (trailing) question
>> mark.  How can we make Gnus' decoder more liberal?
>
> Maybe the patch below does it

Thanks.

> but we must check it thoroughly.

Agreed.

> Would we be able to make complete test cases?
>
> (rfc2047-decode-string "=?ISO-8859-1?Q??foo?=")
> "?foo"
> (rfc2047-decode-string "=?ISO-8859-1?Q?=foo?=")
> "=foo"
> (rfc2047-decode-string "=?ISO-8859-1?Q?foo??=")
> "foo?"
> (rfc2047-decode-string "=?ISO-8859-1?Q?foo?=?=")
> "foo?="
> (rfc2047-decode-string "=?ISO-8859-1?Q?foo?==?ISO-8859-1?Q?bar?=")
> "foobar"

Do you see the other examples often in the wild?  If not, I'd rather
not make the decode too liberal.  And we probably should have an
option to toggle strict/loose decoding.

BTW, another problem is that we "double encode"
(`rfc2047-encode-encoded-words') such subjects:

ELISP> (rfc2047-decode-string "=?ISO-8859-1?Q?foo??=")
"=?ISO-8859-1?Q?foo??="
ELISP> (rfc2047-encode-string "=?ISO-8859-1?Q?foo??=")
"=?us-ascii?Q?=3D=3FISO-8859-1=3FQ=3Ffoo=3F=3F=3D?="

AFAICS, Gnus (`rfc2047-encodable-p'?) simply looks for "=?".  When
found, the string is (double) encoded.  I'm not sure if this behavior
is wrong, but RFC2047 says the an encoded word is...

,----[ RFC2047 ]
|    Generally, an "encoded-word" is a sequence of printable ASCII
|    characters that begins with "=?", ends with "?=", and has two "?"s in
|    between.  It specifies a character set and an encoding method, and
|    also includes the original text encoded as graphic ASCII characters,
|    according to the rules for that encoding method.
`----

..., i.e. shouldn't we use "=\\?.+\\?[qb]\\?.+\\?=" (or similar)
instead of "=?"?

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More liberal MIME decoding (unencoded question marks in encoded words)
  2007-11-26 22:08   ` Reiner Steib
@ 2007-11-27  9:34     ` Katsumi Yamaoka
  2007-12-01 13:17       ` Reiner Steib
  0 siblings, 1 reply; 6+ messages in thread
From: Katsumi Yamaoka @ 2007-11-27  9:34 UTC (permalink / raw)
  To: ding

I've installed the new ones in the Gnus trunk.  Decoding bad Q
encoding is enabled by default.

>>>>> Reiner Steib wrote:
> On Mon, Nov 26 2007, Katsumi Yamaoka wrote:

>> Would we be able to make complete test cases?
>>
>> (rfc2047-decode-string "=?ISO-8859-1?Q??foo?=")
>> "?foo"

[...]

> Do you see the other examples often in the wild?

No, I've never seen such ones at all, though I always examine
raw data when decoding fails.  What I saw were mainly broken B
encoding (99.9% of Japanese MIME messages use B encoding).

> If not, I'd rather not make the decode too liberal.

I thought it's not going too far since it doesn't support encoded
words folded into two or more lines.  In reality, there's the
reason I didn't make it support newlines in encoded words.  Because
the regexp pattern for Q encoding is ambiguous in a sense, if it
supports newlines, it might lead re-search to get stuck with an
encoded word that is not terminated with "?=".

FYI:

> +\\(B\\?[+/0-9A-Za-z]*=*\

This pattern is restricted into only the characters that B
encoding uses, since the base64 decoder doesn't work with data
containing other characters.

> +\\|Q\\?\\(?:\\?+[ -<>@-~]\\)?\\(?:[ ->@-~]+\\?+[ -<>@-~]\\)*[ ->@-~]*\\?*\
> +\\)\\?="))

This pattern is similar to:

"Q\\?\\(\\?+[^\n=?]\\)?\\([^\n?]+\\?+[^\n=?]\\)*[^\n?]*\\?*"
     <--------1-------><----------2,3----------><--4--><-5->

1. After "Q?", allow "?"s that follow a character other than "=".
2. Allow "=" after "Q?"; it isn't regarded as the terminator.
3. In the middle of an encoded word, allow "?"s that follow a
   character other than "=".
4. Allow any characters other than "?" in the middle of an
   encoded word.
5. At the end, allow "?"s.

> And we probably should have an option to toggle strict/loose
> decoding.

I've introduced the `rfc2047-allow-irregular-q-encoded-words'
option.  I wish that it is tested widely, so I've set the default
value to t.  But it might have to be nil when it is imported into
the stable branch.  Now there are two regexps; one is
`rfc2047-encoded-word-regexp' for strict decoding, the other is
`rfc2047-encoded-word-regexp-loose'.

> BTW, another problem is that we "double encode"
> (`rfc2047-encode-encoded-words') such subjects:

ELISP> (rfc2047-decode-string "=?ISO-8859-1?Q?foo??=")
> "=?ISO-8859-1?Q?foo??="
ELISP> (rfc2047-encode-string "=?ISO-8859-1?Q?foo??=")
> "=?us-ascii?Q?=3D=3FISO-8859-1=3FQ=3Ffoo=3F=3F=3D?="

> AFAICS, Gnus (`rfc2047-encodable-p'?) simply looks for "=?".

[...]

> ..., i.e. shouldn't we use "=\\?.+\\?[qb]\\?.+\\?=" (or similar)
> instead of "=?"?

I agree with you.  I've made `rfc2047-encodable-p' use
`rfc2047-encoded-word-regexp' instead of "=?".  It will be hard
to be found out even if this change causes another trouble, though.

Regards,

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More liberal MIME decoding (unencoded question marks in encoded words)
  2007-11-27  9:34     ` Katsumi Yamaoka
@ 2007-12-01 13:17       ` Reiner Steib
  2007-12-04  9:19         ` Katsumi Yamaoka
  0 siblings, 1 reply; 6+ messages in thread
From: Reiner Steib @ 2007-12-01 13:17 UTC (permalink / raw)
  To: ding

On Tue, Nov 27 2007, Katsumi Yamaoka wrote:

>>>>>> Reiner Steib wrote:
>> On Mon, Nov 26 2007, Katsumi Yamaoka wrote:
>>> Would we be able to make complete test cases?

I don't think providing *complete* test cases is possible.  But
regression tests would be very nice to have.

How about defining a variable containing list of (decoded . encoded)
pairs, decode/encode the strings and put the results in a buffer (or
file) and compare them?

[...]
>> +\\|Q\\?\\(?:\\?+[ -<>@-~]\\)?\\(?:[ ->@-~]+\\?+[ -<>@-~]\\)*[ ->@-~]*\\?*\
>> +\\)\\?="))
>
> This pattern is similar to:
>
> "Q\\?\\(\\?+[^\n=?]\\)?\\([^\n?]+\\?+[^\n=?]\\)*[^\n?]*\\?*"
>      <--------1-------><----------2,3----------><--4--><-5->
>
> 1. After "Q?", allow "?"s that follow a character other than "=".
> 2. Allow "=" after "Q?"; it isn't regarded as the terminator.
> 3. In the middle of an encoded word, allow "?"s that follow a
>    character other than "=".
> 4. Allow any characters other than "?" in the middle of an
>    encoded word.
> 5. At the end, allow "?"s.

Could you please add such explanations as comments in `rfc2047.el'?

>> And we probably should have an option to toggle strict/loose
>> decoding.
>
> I've introduced the `rfc2047-allow-irregular-q-encoded-words'
> option.  I wish that it is tested widely, so I've set the default
> value to t.  But it might have to be nil when it is imported into
> the stable branch.  

I see that we already have quite a few of these variables: At least
`rfc2047-allow-irregular-q-encoded-words',
`rfc2047-allow-incomplete-encoded-text' and finally
`gnus-article-loose-mime'.

How about deriving the defaults from a single variable
`rfc2047-allow-loose-mime' (or `rfc2047-loose-mime-decoding'):

(defcustom rfc2047-allow-loose-mime t
  "Allow loose MIME decoding.  ...")
(devar rfc2047-allow-irregular-q-encoded-words rfc2047-allow-loose-mime
  ...)
(devar rfc2047-allow-incomplete-encoded-text rfc2047-allow-loose-mime
  ...)
(defcustom gnus-article-loose-mime rfc2047-allow-loose-mime
  "...
See also `rfc2047-allow-loose-mime'.")

Or even, if we don't need such a fine tuning, use
`rfc2047-allow-loose-mime' directly.

> Now there are two regexps; one is `rfc2047-encoded-word-regexp' for
> strict decoding, the other is `rfc2047-encoded-word-regexp-loose'.

You you explain the purpose of using `eval-and-compile'?

> I agree with you.  I've made `rfc2047-encodable-p' use
> `rfc2047-encoded-word-regexp' instead of "=?".

The defconst needs to be before `rfc2047-encodable-p':

In rfc2047-encodable-p:
rfc2047.el:302:37:Warning: reference to free variable
    `rfc2047-encoded-word-regexp'

> It will be hard to be found out even if this change causes another
> trouble, though.

Yes.

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More liberal MIME decoding (unencoded question marks in encoded words)
  2007-12-01 13:17       ` Reiner Steib
@ 2007-12-04  9:19         ` Katsumi Yamaoka
  0 siblings, 0 replies; 6+ messages in thread
From: Katsumi Yamaoka @ 2007-12-04  9:19 UTC (permalink / raw)
  To: ding

>>>>> Reiner Steib wrote:

>>> On Mon, Nov 26 2007, Katsumi Yamaoka wrote:
>>>> Would we be able to make complete test cases?

> I don't think providing *complete* test cases is possible.  But
> regression tests would be very nice to have.

> How about defining a variable containing list of (decoded . encoded)
> pairs, decode/encode the strings and put the results in a buffer (or
> file) and compare them?

That's a good idea.  Since rfc2047.el does some special treatments,
it will be a good means not only for the test but also for
explaining them.  However, though I've faced various instances
causing troubles while improving rfc2047.el, to my regret I lost
all of them.  I'll scan my articles archived and functions when
I have time in the future.

>>> +\\|Q\\?\\(?:\\?+[ -<>@-~]\\)?\\(?:[ ->@-~]+\\?+[ -<>@-~]\\)*[ ->@-~]*\\?*\
>>> +\\)\\?="))

[...]

> Could you please add such explanations as comments in `rfc2047.el'?

Done.

[...]

> I see that we already have quite a few of these variables: At least
> `rfc2047-allow-irregular-q-encoded-words',
> `rfc2047-allow-incomplete-encoded-text' and finally
> `gnus-article-loose-mime'.

> How about deriving the defaults from a single variable
> `rfc2047-allow-loose-mime' (or `rfc2047-loose-mime-decoding'):

> (defcustom rfc2047-allow-loose-mime t
>   "Allow loose MIME decoding.  ...")
> (devar rfc2047-allow-irregular-q-encoded-words rfc2047-allow-loose-mime
>   ...)
> (devar rfc2047-allow-incomplete-encoded-text rfc2047-allow-loose-mime
>   ...)
> (defcustom gnus-article-loose-mime rfc2047-allow-loose-mime
>   "...
> See also `rfc2047-allow-loose-mime'.")

> Or even, if we don't need such a fine tuning, use
> `rfc2047-allow-loose-mime' directly.

I agree with integrating those options into the only
`rfc2047-allow-loose-mime'.  However, for `gnus-article-loose-mime',
I don't think making it refer to the rfc2047- variable is proper
because rfc2047 is not necessarily a representative for the MIME
features.  Needless to say, making rfc2047.el refer to the value
of `gnus-article-loose-mime' is bad, too.  Also
(gnus|mm|gmm)-utils.el seem not to be a good place to define MIME-
related things.  Hmm, I have no good idea.

>> Now there are two regexps; one is `rfc2047-encoded-word-regexp' for
>> strict decoding, the other is `rfc2047-encoded-word-regexp-loose'.

> You you explain the purpose of using `eval-and-compile'?

Because those values are the constants and the values derived
from them are hard coded in `rfc2047-decode-region'.  Without
`eval-and-compile' Emacs complains as:

Error: Symbol's value as variable is void:
 rfc2047-encoded-word-regexp-loose

There might be room for considering whether they should be
constants or not, though.

> The defconst needs to be before `rfc2047-encodable-p':

> In rfc2047-encodable-p:
> rfc2047.el:302:37:Warning: reference to free variable
>     `rfc2047-encoded-word-regexp'

Fixed.  I didn't notice it until I tried compiling rfc2047.el
individually.  Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-12-04  9:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-24 13:29 More liberal MIME decoding (unencoded question marks in encoded words) Reiner Steib
2007-11-26 12:31 ` Katsumi Yamaoka
2007-11-26 22:08   ` Reiner Steib
2007-11-27  9:34     ` Katsumi Yamaoka
2007-12-01 13:17       ` Reiner Steib
2007-12-04  9:19         ` Katsumi Yamaoka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).