From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/65789 Path: news.gmane.org!not-for-mail From: Katsumi Yamaoka Newsgroups: gmane.emacs.gnus.general Subject: Re: More liberal MIME decoding (unencoded question marks in encoded words) Date: Tue, 27 Nov 2007 18:34:12 +0900 Organization: Emacsen advocacy group Message-ID: References: NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1196174944 22028 80.91.229.12 (27 Nov 2007 14:49:04 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 27 Nov 2007 14:49:04 +0000 (UTC) To: ding@gnus.org Original-X-From: ding-owner+M14285@lists.math.uh.edu Tue Nov 27 15:49:02 2007 Return-path: Envelope-to: ding-account@gmane.org Original-Received: from util0.math.uh.edu ([129.7.128.18]) by lo.gmane.org with esmtp (Exim 4.50) id 1Ix1jr-0002Cl-Oh for ding-account@gmane.org; Tue, 27 Nov 2007 15:48:40 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.math.uh.edu) by util0.math.uh.edu with smtp (Exim 4.63) (envelope-from ) id 1Ix1jY-0007q8-NS; Tue, 27 Nov 2007 08:48:20 -0600 Original-Received: from mx1.math.uh.edu ([129.7.128.32]) by util0.math.uh.edu with esmtps (TLSv1:AES256-SHA:256) (Exim 4.63) (envelope-from ) id 1Ix1jX-0007pl-2m for ding@lists.math.uh.edu; Tue, 27 Nov 2007 08:48:19 -0600 Original-Received: from quimby.gnus.org ([80.91.231.51]) by mx1.math.uh.edu with esmtp (Exim 4.67) (envelope-from ) id 1Ix1jL-0003Vj-TO for ding@lists.math.uh.edu; Tue, 27 Nov 2007 08:48:18 -0600 Original-Received: from orlando.hostforweb.net ([216.246.45.90]) by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian)) id 1IwwpF-0005PS-00 for ; Tue, 27 Nov 2007 10:33:53 +0100 Original-Received: from [66.225.201.151] (port=51617 helo=mail.jpl.org) by orlando.hostforweb.net with esmtpa (Exim 4.68) (envelope-from ) id 1Iwwpi-0002r7-Gi for ding@gnus.org; Tue, 27 Nov 2007 03:34:23 -0600 X-Hashcash: 1:20:071127:ding@gnus.org::k+WPh1QG+hk+macQ:00000t/h X-Face: #kKnN,xUnmKia.'[pp`;Omh}odZK)?7wQSl"4o04=EixTF+V[""w~iNbM9ZL+.b*_CxUmFk B#Fu[*?MZZH@IkN:!"\w%I_zt>[$nm7nQosZ<3eu;B:$Q_:p!',P.c0-_Cy[dz4oIpw0ESA^D*1Lw= L&i*6&( User-Agent: Gnus/5.110007 (No Gnus v0.7) Emacs/23.0.50 (gnu/linux) Cancel-Lock: sha1:AXhfK0u1yAIYvxqmZQETf2HG6Lc= X-Antivirus-Scanner: Clean mail though you should still use an Antivirus X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - orlando.hostforweb.net X-AntiAbuse: Original Domain - gnus.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - jpl.org X-Source: X-Source-Args: X-Source-Dir: X-Spam-Score: -2.4 (--) List-ID: Precedence: bulk Xref: news.gmane.org gmane.emacs.gnus.general:65789 Archived-At: I've installed the new ones in the Gnus trunk. Decoding bad Q encoding is enabled by default. >>>>> Reiner Steib wrote: > On Mon, Nov 26 2007, Katsumi Yamaoka wrote: >> Would we be able to make complete test cases? >> >> (rfc2047-decode-string "=?ISO-8859-1?Q??foo?=") >> "?foo" [...] > Do you see the other examples often in the wild? No, I've never seen such ones at all, though I always examine raw data when decoding fails. What I saw were mainly broken B encoding (99.9% of Japanese MIME messages use B encoding). > If not, I'd rather not make the decode too liberal. I thought it's not going too far since it doesn't support encoded words folded into two or more lines. In reality, there's the reason I didn't make it support newlines in encoded words. Because the regexp pattern for Q encoding is ambiguous in a sense, if it supports newlines, it might lead re-search to get stuck with an encoded word that is not terminated with "?=". FYI: > +\\(B\\?[+/0-9A-Za-z]*=*\ This pattern is restricted into only the characters that B encoding uses, since the base64 decoder doesn't work with data containing other characters. > +\\|Q\\?\\(?:\\?+[ -<>@-~]\\)?\\(?:[ ->@-~]+\\?+[ -<>@-~]\\)*[ ->@-~]*\\?*\ > +\\)\\?=")) This pattern is similar to: "Q\\?\\(\\?+[^\n=?]\\)?\\([^\n?]+\\?+[^\n=?]\\)*[^\n?]*\\?*" <--------1-------><----------2,3----------><--4--><-5-> 1. After "Q?", allow "?"s that follow a character other than "=". 2. Allow "=" after "Q?"; it isn't regarded as the terminator. 3. In the middle of an encoded word, allow "?"s that follow a character other than "=". 4. Allow any characters other than "?" in the middle of an encoded word. 5. At the end, allow "?"s. > And we probably should have an option to toggle strict/loose > decoding. I've introduced the `rfc2047-allow-irregular-q-encoded-words' option. I wish that it is tested widely, so I've set the default value to t. But it might have to be nil when it is imported into the stable branch. Now there are two regexps; one is `rfc2047-encoded-word-regexp' for strict decoding, the other is `rfc2047-encoded-word-regexp-loose'. > BTW, another problem is that we "double encode" > (`rfc2047-encode-encoded-words') such subjects: ELISP> (rfc2047-decode-string "=?ISO-8859-1?Q?foo??=") > "=?ISO-8859-1?Q?foo??=" ELISP> (rfc2047-encode-string "=?ISO-8859-1?Q?foo??=") > "=?us-ascii?Q?=3D=3FISO-8859-1=3FQ=3Ffoo=3F=3F=3D?=" > AFAICS, Gnus (`rfc2047-encodable-p'?) simply looks for "=?". [...] > ..., i.e. shouldn't we use "=\\?.+\\?[qb]\\?.+\\?=" (or similar) > instead of "=?"? I agree with you. I've made `rfc2047-encodable-p' use `rfc2047-encoded-word-regexp' instead of "=?". It will be hard to be found out even if this change causes another trouble, though. Regards,