From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/61176
Path: news.gmane.org!not-for-mail
From: Katsumi Yamaoka <yamaoka@jpl.org>
Newsgroups: gmane.emacs.gnus.general
Subject: Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
Date: Sat, 15 Oct 2005 01:51:00 +0900
Message-ID: <b4my84wf3ez.fsf@jpl.org>
References: <E1EQHq4-0002rQ-EC@fencepost.gnu.org> <E1EQPSa-0006iC-00@etlken>
NNTP-Posting-Host: main.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Trace: sea.gmane.org 1129309343 18442 80.91.229.2 (14 Oct 2005 17:02:23 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Fri, 14 Oct 2005 17:02:23 +0000 (UTC)
Cc: rms@gnu.org, bsam@ipt.ru, ding@gnus.org
Original-X-From: ding-owner+m9709@lists.math.uh.edu Fri Oct 14 19:02:21 2005
Return-path: <ding-owner+m9709@lists.math.uh.edu>
Original-Received: from malifon.math.uh.edu ([129.7.128.13])
	by ciao.gmane.org with esmtp (Exim 4.43)
	id 1EQSv7-0001yi-HA
	for ding-account@gmane.org; Fri, 14 Oct 2005 19:00:38 +0200
Original-Received: from localhost
	([127.0.0.1] helo=lists.math.uh.edu ident=lists)
	by malifon.math.uh.edu with smtp (Exim 3.20 #1)
	id 1EQSux-0003Cj-00; Fri, 14 Oct 2005 12:00:27 -0500
Original-Received: from nas01.math.uh.edu ([129.7.128.39])
	by malifon.math.uh.edu with esmtp (Exim 3.20 #1)
	id 1EQSm0-0003Ce-00
	for ding@lists.math.uh.edu; Fri, 14 Oct 2005 11:51:12 -0500
Original-Received: from quimby.gnus.org ([80.91.224.244])
	by nas01.math.uh.edu with esmtp (Exim 4.52)
	id 1EQSlx-000818-L3
	for ding@lists.math.uh.edu; Fri, 14 Oct 2005 11:51:12 -0500
Original-Received: from washington.hostforweb.net ([66.225.201.13])
	by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian))
	id 1EQSlw-0008Vr-00
	for <ding@gnus.org>; Fri, 14 Oct 2005 18:51:08 +0200
Original-Received: from yahoobb218118002085.bbtec.net ([218.118.2.85]:65108 helo=)
	by washington.hostforweb.net with esmtpsa (TLSv1:AES256-SHA:256)
	(Exim 4.52)
	id 1EQSlz-0005tq-Or; Fri, 14 Oct 2005 11:51:12 -0500
Original-To: Kenichi Handa <handa@m17n.org>
X-Face: #kKnN,xUnmKia.'[pp`;Omh}odZK)?7wQSl"4o04=EixTF+V[""w~iNbM9ZL+.b*_CxUmFk
 B#Fu[*?MZZH@IkN:!"\w%I_zt>[$nm7nQosZ<3eu;B:$Q_:p!',P.c0-_Cy[dz4oIpw0ESA^D*1Lw=
 L&i*6&(
User-Agent: Gnus/5.110004 (No Gnus v0.4) Emacs/22.0.50 (gnu/linux)
Cancel-Lock: sha1:Se8XaaQPyN/YXbMtTj9/sqDOX0o=
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - washington.hostforweb.net
X-AntiAbuse: Original Domain - gnus.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - jpl.org
X-Source: 
X-Source-Args: 
X-Source-Dir: 
X-Spam-Score: -2.6 (--)
Precedence: bulk
Original-Sender: ding-owner@lists.math.uh.edu
Xref: news.gmane.org gmane.emacs.gnus.general:61176
Archived-At: <http://permalink.gmane.org/gmane.emacs.gnus.general/61176>

I added the ding list to Cc:.

>>>>> In <E1EQPSa-0006iC-00@etlken> Handa-san wrote:

> Ok.  Yamaoka-san, it seems that you are the last modifier of
> the relevant part of rfc2047.el, so I include you in CC:.

I'd improved only the rfc2047.el encoder, but that's ok.

>>>>> In <E1EQ6NU-000GJF-2s@bsam.ru>
>>>>>	"Boris B. Samorodov" <bsam@ipt.ru> wrote:

>> The bug appeared to be at illegal concatenation of
>> =3D?UTF-8?<foo> =3D?UTF-8?<bar> parts of the Subject.

> Yes, the bug is in the way of handling that parts.

> The current code does this:

> (1) Remove spaces between encoded words.

> (2) Decode content-transfer-encoding of <foo> and decode the
> resulting text by utf-8, then decode
> content-transfer-encoding of <bar> and decode the resulting
> text by utf-8.

> But it doesn't work if <foo> and <bar> are devided not at
> character boundary of utf-8.  The above case is this.

I see.  The sample that Boris B. Samorodov brought up to the
pretest-bug list first gives the following result:

(prin1
 (rfc2047-decode-string
  "=3D?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ=
?=3D =3D?UTF-8?B?nyDRgtC10YHRgg=3D=3D?=3D"))
"[ipt.ru #163] =D0=90=D0=B2=D1=82=D0=BE=D0=9E=D1=82=D0=B2=D0=B5=D1=82: =D0=
=9C=D0=A1=D0=9A: =D0=A1\xc3\x90\xc2\x9f =D1=82=D0=B5=D1=81=D1=82"

And FLIM's decoder does the same.

> So what we should do is:

> (2') Decode content-transfer-encoding of <foo> and <bar>
> while keeping information of coding system (utf-8 in this
> case) on each part.  Then decode the text encoding of a run
> that has the same coding system at once.

> I'll attach a sample patch for doing that.  I modified
> rfc2047-decode-string with a helper function
> rfc2047-code-cte.

> As I don't know the detail of rfc2047, I have not yet
> installed it.  Could you please check the code and install
> it (or a version that does the similar thing).

Thank you for the patch, but I'm not sure whether dividing of
encoded words in that way is rightful.  I need time to look into
it.

> *** rfc2047.el	08 Aug 2005 10:13:38 +0900	1.22
> --- rfc2047.el	14 Oct 2005 22:16:03 +0900=09
> ***************
> *** 822,827 ****
> --- 822,843 ----
>   ;; and worthwhile (is it more correct or not?), e.g. something like
>   ;; `=3D?iso-8859-1?q?foo?=3D@'.

> + (defun rfc2047-decode-cte (charset encoding word)
> +   "Decode content-transfer-encoding of WORD by ENCODING.
> + Put text property `coding' to the decoded word with value a coding syst=
em
> + derived from CHARSET."
> +   (cond ((char-equal ?B encoding)
> + 	 (setq word (base64-decode-string (rfc2047-pad-base64 word))))
> + 	((char-equal ?Q encoding)
> + 	 (setq word (quoted-printable-decode-string
> + 		     (mm-subst-char-in-string ?_ ? word t))))
> + 	(t (error "Invalid encoding: %c" encoding)))
> +   (setq word (string-to-multibyte word))
> +   (setq charset (intern (downcase charset)))
> +   (put-text-property 0 (length word)=20
> + 		     'coding (mm-charset-to-coding-system charset) word)
> +   word)
> +=20
>   (defun rfc2047-decode-region (start end)
>     "Decode MIME-encoded words in region between START and END."
>     (interactive "r")
> ***************
> *** 842,857 ****
>   	;; Decode the encoded words.
>   	(setq b (goto-char (point-min)))
>   	(while (re-search-forward rfc2047-encoded-word-regexp nil t)
>   	  (setq e (match-beginning 0))
> ! 	  (insert (rfc2047-parse-and-decode
> ! 		   (prog1
> ! 		       (match-string 0)
> ! 		     (delete-region e (match-end 0)))))
> ! 	  (while (looking-at rfc2047-encoded-word-regexp)
> ! 	    (insert (rfc2047-parse-and-decode
> ! 		     (prog1
> ! 			 (match-string 0)
> ! 		       (delete-region (point) (match-end 0))))))
>   	  (save-restriction
>   	    (narrow-to-region e (point))
>   	    (goto-char e)
> --- 858,888 ----
>   	;; Decode the encoded words.
>   	(setq b (goto-char (point-min)))
>   	(while (re-search-forward rfc2047-encoded-word-regexp nil t)
> + 	  ;; At first, decode content-transfer-encoding of the
> + 	  ;; succeeding encoded words.
>   	  (setq e (match-beginning 0))
> ! 	  (let ((charset (match-string 1))
> ! 		(encoding (char-after (match-beginning 3)))
> ! 		(word (match-string 4)))
> ! 	    (delete-region e (match-end 0))
> ! 	    (insert (rfc2047-decode-cte charset encoding word))
> ! 	    (while (looking-at rfc2047-encoded-word-regexp)
> ! 	      (setq charset (match-string 1)
> ! 		    encoding (char-after (match-beginning 3))
> ! 		    word (match-string 4))
> ! 	      (delete-region (point) (match-end 0))
> ! 	      (insert (rfc2047-decode-cte charset encoding word))))
> ! 	  ;; Then decode the text encoding.
> ! 	  (save-restriction
> ! 	    (narrow-to-region e (point))
> ! 	    (goto-char e)
> ! 	    (while (not (eobp))
> ! 	      (let ((from (point))
> ! 		    (coding (get-text-property (point) 'coding)))
> ! 		(goto-char (next-single-property-change from coding nil=20
> ! 							(point-max)))
> ! 		(if coding
> ! 		    (decode-coding-region from (point) coding)))))
>   	  (save-restriction
>   	    (narrow-to-region e (point))
>   	    (goto-char e)