From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/61176 Path: news.gmane.org!not-for-mail From: Katsumi Yamaoka Newsgroups: gmane.emacs.gnus.general Subject: Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r Date: Sat, 15 Oct 2005 01:51:00 +0900 Message-ID: References: NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: sea.gmane.org 1129309343 18442 80.91.229.2 (14 Oct 2005 17:02:23 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Fri, 14 Oct 2005 17:02:23 +0000 (UTC) Cc: rms@gnu.org, bsam@ipt.ru, ding@gnus.org Original-X-From: ding-owner+m9709@lists.math.uh.edu Fri Oct 14 19:02:21 2005 Return-path: Original-Received: from malifon.math.uh.edu ([129.7.128.13]) by ciao.gmane.org with esmtp (Exim 4.43) id 1EQSv7-0001yi-HA for ding-account@gmane.org; Fri, 14 Oct 2005 19:00:38 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.math.uh.edu ident=lists) by malifon.math.uh.edu with smtp (Exim 3.20 #1) id 1EQSux-0003Cj-00; Fri, 14 Oct 2005 12:00:27 -0500 Original-Received: from nas01.math.uh.edu ([129.7.128.39]) by malifon.math.uh.edu with esmtp (Exim 3.20 #1) id 1EQSm0-0003Ce-00 for ding@lists.math.uh.edu; Fri, 14 Oct 2005 11:51:12 -0500 Original-Received: from quimby.gnus.org ([80.91.224.244]) by nas01.math.uh.edu with esmtp (Exim 4.52) id 1EQSlx-000818-L3 for ding@lists.math.uh.edu; Fri, 14 Oct 2005 11:51:12 -0500 Original-Received: from washington.hostforweb.net ([66.225.201.13]) by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian)) id 1EQSlw-0008Vr-00 for ; Fri, 14 Oct 2005 18:51:08 +0200 Original-Received: from yahoobb218118002085.bbtec.net ([218.118.2.85]:65108 helo=) by washington.hostforweb.net with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.52) id 1EQSlz-0005tq-Or; Fri, 14 Oct 2005 11:51:12 -0500 Original-To: Kenichi Handa X-Face: #kKnN,xUnmKia.'[pp`;Omh}odZK)?7wQSl"4o04=EixTF+V[""w~iNbM9ZL+.b*_CxUmFk B#Fu[*?MZZH@IkN:!"\w%I_zt>[$nm7nQosZ<3eu;B:$Q_:p!',P.c0-_Cy[dz4oIpw0ESA^D*1Lw= L&i*6&( User-Agent: Gnus/5.110004 (No Gnus v0.4) Emacs/22.0.50 (gnu/linux) Cancel-Lock: sha1:Se8XaaQPyN/YXbMtTj9/sqDOX0o= X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - washington.hostforweb.net X-AntiAbuse: Original Domain - gnus.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - jpl.org X-Source: X-Source-Args: X-Source-Dir: X-Spam-Score: -2.6 (--) Precedence: bulk Original-Sender: ding-owner@lists.math.uh.edu Xref: news.gmane.org gmane.emacs.gnus.general:61176 Archived-At: I added the ding list to Cc:. >>>>> In Handa-san wrote: > Ok. Yamaoka-san, it seems that you are the last modifier of > the relevant part of rfc2047.el, so I include you in CC:. I'd improved only the rfc2047.el encoder, but that's ok. >>>>> In >>>>> "Boris B. Samorodov" wrote: >> The bug appeared to be at illegal concatenation of >> =3D?UTF-8? =3D?UTF-8? parts of the Subject. > Yes, the bug is in the way of handling that parts. > The current code does this: > (1) Remove spaces between encoded words. > (2) Decode content-transfer-encoding of and decode the > resulting text by utf-8, then decode > content-transfer-encoding of and decode the resulting > text by utf-8. > But it doesn't work if and are devided not at > character boundary of utf-8. The above case is this. I see. The sample that Boris B. Samorodov brought up to the pretest-bug list first gives the following result: (prin1 (rfc2047-decode-string "=3D?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ= ?=3D =3D?UTF-8?B?nyDRgtC10YHRgg=3D=3D?=3D")) "[ipt.ru #163] =D0=90=D0=B2=D1=82=D0=BE=D0=9E=D1=82=D0=B2=D0=B5=D1=82: =D0= =9C=D0=A1=D0=9A: =D0=A1\xc3\x90\xc2\x9f =D1=82=D0=B5=D1=81=D1=82" And FLIM's decoder does the same. > So what we should do is: > (2') Decode content-transfer-encoding of and > while keeping information of coding system (utf-8 in this > case) on each part. Then decode the text encoding of a run > that has the same coding system at once. > I'll attach a sample patch for doing that. I modified > rfc2047-decode-string with a helper function > rfc2047-code-cte. > As I don't know the detail of rfc2047, I have not yet > installed it. Could you please check the code and install > it (or a version that does the similar thing). Thank you for the patch, but I'm not sure whether dividing of encoded words in that way is rightful. I need time to look into it. > *** rfc2047.el 08 Aug 2005 10:13:38 +0900 1.22 > --- rfc2047.el 14 Oct 2005 22:16:03 +0900=09 > *************** > *** 822,827 **** > --- 822,843 ---- > ;; and worthwhile (is it more correct or not?), e.g. something like > ;; `=3D?iso-8859-1?q?foo?=3D@'. > + (defun rfc2047-decode-cte (charset encoding word) > + "Decode content-transfer-encoding of WORD by ENCODING. > + Put text property `coding' to the decoded word with value a coding syst= em > + derived from CHARSET." > + (cond ((char-equal ?B encoding) > + (setq word (base64-decode-string (rfc2047-pad-base64 word)))) > + ((char-equal ?Q encoding) > + (setq word (quoted-printable-decode-string > + (mm-subst-char-in-string ?_ ? word t)))) > + (t (error "Invalid encoding: %c" encoding))) > + (setq word (string-to-multibyte word)) > + (setq charset (intern (downcase charset))) > + (put-text-property 0 (length word)=20 > + 'coding (mm-charset-to-coding-system charset) word) > + word) > +=20 > (defun rfc2047-decode-region (start end) > "Decode MIME-encoded words in region between START and END." > (interactive "r") > *************** > *** 842,857 **** > ;; Decode the encoded words. > (setq b (goto-char (point-min))) > (while (re-search-forward rfc2047-encoded-word-regexp nil t) > (setq e (match-beginning 0)) > ! (insert (rfc2047-parse-and-decode > ! (prog1 > ! (match-string 0) > ! (delete-region e (match-end 0))))) > ! (while (looking-at rfc2047-encoded-word-regexp) > ! (insert (rfc2047-parse-and-decode > ! (prog1 > ! (match-string 0) > ! (delete-region (point) (match-end 0)))))) > (save-restriction > (narrow-to-region e (point)) > (goto-char e) > --- 858,888 ---- > ;; Decode the encoded words. > (setq b (goto-char (point-min))) > (while (re-search-forward rfc2047-encoded-word-regexp nil t) > + ;; At first, decode content-transfer-encoding of the > + ;; succeeding encoded words. > (setq e (match-beginning 0)) > ! (let ((charset (match-string 1)) > ! (encoding (char-after (match-beginning 3))) > ! (word (match-string 4))) > ! (delete-region e (match-end 0)) > ! (insert (rfc2047-decode-cte charset encoding word)) > ! (while (looking-at rfc2047-encoded-word-regexp) > ! (setq charset (match-string 1) > ! encoding (char-after (match-beginning 3)) > ! word (match-string 4)) > ! (delete-region (point) (match-end 0)) > ! (insert (rfc2047-decode-cte charset encoding word)))) > ! ;; Then decode the text encoding. > ! (save-restriction > ! (narrow-to-region e (point)) > ! (goto-char e) > ! (while (not (eobp)) > ! (let ((from (point)) > ! (coding (get-text-property (point) 'coding))) > ! (goto-char (next-single-property-change from coding nil=20 > ! (point-max))) > ! (if coding > ! (decode-coding-region from (point) coding))))) > (save-restriction > (narrow-to-region e (point)) > (goto-char e)