From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/65789
Path: news.gmane.org!not-for-mail
From: Katsumi Yamaoka <yamaoka@jpl.org>
Newsgroups: gmane.emacs.gnus.general
Subject: Re: More liberal MIME decoding (unencoded question marks in encoded words)
Date: Tue, 27 Nov 2007 18:34:12 +0900
Organization: Emacsen advocacy group
Message-ID: <b4m7ik4jaln.fsf@jpl.org>
References: <v963zrepqc.fsf@marauder.physik.uni-ulm.de>
 <b4m3autchn1.fsf@jpl.org> <v9y7ckoe1t.fsf@marauder.physik.uni-ulm.de>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1196174944 22028 80.91.229.12 (27 Nov 2007 14:49:04 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Tue, 27 Nov 2007 14:49:04 +0000 (UTC)
To: ding@gnus.org
Original-X-From: ding-owner+M14285@lists.math.uh.edu Tue Nov 27 15:49:02 2007
Return-path: <ding-owner+M14285@lists.math.uh.edu>
Envelope-to: ding-account@gmane.org
Original-Received: from util0.math.uh.edu ([129.7.128.18])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1Ix1jr-0002Cl-Oh
	for ding-account@gmane.org; Tue, 27 Nov 2007 15:48:40 +0100
Original-Received: from localhost ([127.0.0.1] helo=lists.math.uh.edu)
	by util0.math.uh.edu with smtp (Exim 4.63)
	(envelope-from <ding-owner+M14285@lists.math.uh.edu>)
	id 1Ix1jY-0007q8-NS; Tue, 27 Nov 2007 08:48:20 -0600
Original-Received: from mx1.math.uh.edu ([129.7.128.32])
	by util0.math.uh.edu with esmtps (TLSv1:AES256-SHA:256)
	(Exim 4.63)
	(envelope-from <yamaoka@jpl.org>)
	id 1Ix1jX-0007pl-2m
	for ding@lists.math.uh.edu; Tue, 27 Nov 2007 08:48:19 -0600
Original-Received: from quimby.gnus.org ([80.91.231.51])
	by mx1.math.uh.edu with esmtp (Exim 4.67)
	(envelope-from <yamaoka@jpl.org>)
	id 1Ix1jL-0003Vj-TO
	for ding@lists.math.uh.edu; Tue, 27 Nov 2007 08:48:18 -0600
Original-Received: from orlando.hostforweb.net ([216.246.45.90])
	by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian))
	id 1IwwpF-0005PS-00
	for <ding@gnus.org>; Tue, 27 Nov 2007 10:33:53 +0100
Original-Received: from [66.225.201.151] (port=51617 helo=mail.jpl.org)
	by orlando.hostforweb.net with esmtpa (Exim 4.68)
	(envelope-from <yamaoka@jpl.org>)
	id 1Iwwpi-0002r7-Gi
	for ding@gnus.org; Tue, 27 Nov 2007 03:34:23 -0600
X-Hashcash: 1:20:071127:ding@gnus.org::k+WPh1QG+hk+macQ:00000t/h
X-Face: #kKnN,xUnmKia.'[pp`;Omh}odZK)?7wQSl"4o04=EixTF+V[""w~iNbM9ZL+.b*_CxUmFk
 B#Fu[*?MZZH@IkN:!"\w%I_zt>[$nm7nQosZ<3eu;B:$Q_:p!',P.c0-_Cy[dz4oIpw0ESA^D*1Lw=
 L&i*6&(
User-Agent: Gnus/5.110007 (No Gnus v0.7) Emacs/23.0.50 (gnu/linux)
Cancel-Lock: sha1:AXhfK0u1yAIYvxqmZQETf2HG6Lc=
X-Antivirus-Scanner: Clean mail though you should still use an Antivirus
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - orlando.hostforweb.net
X-AntiAbuse: Original Domain - gnus.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - jpl.org
X-Source: 
X-Source-Args: 
X-Source-Dir: 
X-Spam-Score: -2.4 (--)
List-ID: <ding.gnus.org>
Precedence: bulk
Xref: news.gmane.org gmane.emacs.gnus.general:65789
Archived-At: <http://permalink.gmane.org/gmane.emacs.gnus.general/65789>

I've installed the new ones in the Gnus trunk.  Decoding bad Q
encoding is enabled by default.

>>>>> Reiner Steib wrote:
> On Mon, Nov 26 2007, Katsumi Yamaoka wrote:

>> Would we be able to make complete test cases?
>>
>> (rfc2047-decode-string "=?ISO-8859-1?Q??foo?=")
>> "?foo"

[...]

> Do you see the other examples often in the wild?

No, I've never seen such ones at all, though I always examine
raw data when decoding fails.  What I saw were mainly broken B
encoding (99.9% of Japanese MIME messages use B encoding).

> If not, I'd rather not make the decode too liberal.

I thought it's not going too far since it doesn't support encoded
words folded into two or more lines.  In reality, there's the
reason I didn't make it support newlines in encoded words.  Because
the regexp pattern for Q encoding is ambiguous in a sense, if it
supports newlines, it might lead re-search to get stuck with an
encoded word that is not terminated with "?=".

FYI:

> +\\(B\\?[+/0-9A-Za-z]*=*\

This pattern is restricted into only the characters that B
encoding uses, since the base64 decoder doesn't work with data
containing other characters.

> +\\|Q\\?\\(?:\\?+[ -<>@-~]\\)?\\(?:[ ->@-~]+\\?+[ -<>@-~]\\)*[ ->@-~]*\\?*\
> +\\)\\?="))

This pattern is similar to:

"Q\\?\\(\\?+[^\n=?]\\)?\\([^\n?]+\\?+[^\n=?]\\)*[^\n?]*\\?*"
     <--------1-------><----------2,3----------><--4--><-5->

1. After "Q?", allow "?"s that follow a character other than "=".
2. Allow "=" after "Q?"; it isn't regarded as the terminator.
3. In the middle of an encoded word, allow "?"s that follow a
   character other than "=".
4. Allow any characters other than "?" in the middle of an
   encoded word.
5. At the end, allow "?"s.

> And we probably should have an option to toggle strict/loose
> decoding.

I've introduced the `rfc2047-allow-irregular-q-encoded-words'
option.  I wish that it is tested widely, so I've set the default
value to t.  But it might have to be nil when it is imported into
the stable branch.  Now there are two regexps; one is
`rfc2047-encoded-word-regexp' for strict decoding, the other is
`rfc2047-encoded-word-regexp-loose'.

> BTW, another problem is that we "double encode"
> (`rfc2047-encode-encoded-words') such subjects:

ELISP> (rfc2047-decode-string "=?ISO-8859-1?Q?foo??=")
> "=?ISO-8859-1?Q?foo??="
ELISP> (rfc2047-encode-string "=?ISO-8859-1?Q?foo??=")
> "=?us-ascii?Q?=3D=3FISO-8859-1=3FQ=3Ffoo=3F=3F=3D?="

> AFAICS, Gnus (`rfc2047-encodable-p'?) simply looks for "=?".

[...]

> ..., i.e. shouldn't we use "=\\?.+\\?[qb]\\?.+\\?=" (or similar)
> instead of "=?"?

I agree with you.  I've made `rfc2047-encodable-p' use
`rfc2047-encoded-word-regexp' instead of "=?".  It will be hard
to be found out even if this change causes another trouble, though.

Regards,