Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r

Gnus development mailing list
 help / color / mirror / Atom feed

* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
       [not found] ` <E1EQPSa-0006iC-00@etlken>
@ 2005-10-14 16:51   ` Katsumi Yamaoka
  2005-10-15  0:46     ` Kenichi Handa
  0 siblings, 1 reply; 10+ messages in thread
From: Katsumi Yamaoka @ 2005-10-14 16:51 UTC (permalink / raw)
  Cc: rms, bsam, ding

I added the ding list to Cc:.

>>>>> In <E1EQPSa-0006iC-00@etlken> Handa-san wrote:

> Ok.  Yamaoka-san, it seems that you are the last modifier of
> the relevant part of rfc2047.el, so I include you in CC:.

I'd improved only the rfc2047.el encoder, but that's ok.

>>>>> In <E1EQ6NU-000GJF-2s@bsam.ru>
>>>>>	"Boris B. Samorodov" <bsam@ipt.ru> wrote:

>> The bug appeared to be at illegal concatenation of
>> =?UTF-8?<foo> =?UTF-8?<bar> parts of the Subject.

> Yes, the bug is in the way of handling that parts.

> The current code does this:

> (1) Remove spaces between encoded words.

> (2) Decode content-transfer-encoding of <foo> and decode the
> resulting text by utf-8, then decode
> content-transfer-encoding of <bar> and decode the resulting
> text by utf-8.

> But it doesn't work if <foo> and <bar> are devided not at
> character boundary of utf-8.  The above case is this.

I see.  The sample that Boris B. Samorodov brought up to the
pretest-bug list first gives the following result:

(prin1
 (rfc2047-decode-string
  "=?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?="))
"[ipt.ru #163] АвтоОтвет: МСК: С\xc3\x90\xc2\x9f тест"

And FLIM's decoder does the same.

> So what we should do is:

> (2') Decode content-transfer-encoding of <foo> and <bar>
> while keeping information of coding system (utf-8 in this
> case) on each part.  Then decode the text encoding of a run
> that has the same coding system at once.

> I'll attach a sample patch for doing that.  I modified
> rfc2047-decode-string with a helper function
> rfc2047-code-cte.

> As I don't know the detail of rfc2047, I have not yet
> installed it.  Could you please check the code and install
> it (or a version that does the similar thing).

Thank you for the patch, but I'm not sure whether dividing of
encoded words in that way is rightful.  I need time to look into
it.

> *** rfc2047.el	08 Aug 2005 10:13:38 +0900	1.22
> --- rfc2047.el	14 Oct 2005 22:16:03 +0900	
> ***************
> *** 822,827 ****
> --- 822,843 ----
>   ;; and worthwhile (is it more correct or not?), e.g. something like
>   ;; `=?iso-8859-1?q?foo?=@'.

> + (defun rfc2047-decode-cte (charset encoding word)
> +   "Decode content-transfer-encoding of WORD by ENCODING.
> + Put text property `coding' to the decoded word with value a coding system
> + derived from CHARSET."
> +   (cond ((char-equal ?B encoding)
> + 	 (setq word (base64-decode-string (rfc2047-pad-base64 word))))
> + 	((char-equal ?Q encoding)
> + 	 (setq word (quoted-printable-decode-string
> + 		     (mm-subst-char-in-string ?_ ? word t))))
> + 	(t (error "Invalid encoding: %c" encoding)))
> +   (setq word (string-to-multibyte word))
> +   (setq charset (intern (downcase charset)))
> +   (put-text-property 0 (length word) 
> + 		     'coding (mm-charset-to-coding-system charset) word)
> +   word)
> + 
>   (defun rfc2047-decode-region (start end)
>     "Decode MIME-encoded words in region between START and END."
>     (interactive "r")
> ***************
> *** 842,857 ****
>   	;; Decode the encoded words.
>   	(setq b (goto-char (point-min)))
>   	(while (re-search-forward rfc2047-encoded-word-regexp nil t)
>   	  (setq e (match-beginning 0))
> ! 	  (insert (rfc2047-parse-and-decode
> ! 		   (prog1
> ! 		       (match-string 0)
> ! 		     (delete-region e (match-end 0)))))
> ! 	  (while (looking-at rfc2047-encoded-word-regexp)
> ! 	    (insert (rfc2047-parse-and-decode
> ! 		     (prog1
> ! 			 (match-string 0)
> ! 		       (delete-region (point) (match-end 0))))))
>   	  (save-restriction
>   	    (narrow-to-region e (point))
>   	    (goto-char e)
> --- 858,888 ----
>   	;; Decode the encoded words.
>   	(setq b (goto-char (point-min)))
>   	(while (re-search-forward rfc2047-encoded-word-regexp nil t)
> + 	  ;; At first, decode content-transfer-encoding of the
> + 	  ;; succeeding encoded words.
>   	  (setq e (match-beginning 0))
> ! 	  (let ((charset (match-string 1))
> ! 		(encoding (char-after (match-beginning 3)))
> ! 		(word (match-string 4)))
> ! 	    (delete-region e (match-end 0))
> ! 	    (insert (rfc2047-decode-cte charset encoding word))
> ! 	    (while (looking-at rfc2047-encoded-word-regexp)
> ! 	      (setq charset (match-string 1)
> ! 		    encoding (char-after (match-beginning 3))
> ! 		    word (match-string 4))
> ! 	      (delete-region (point) (match-end 0))
> ! 	      (insert (rfc2047-decode-cte charset encoding word))))
> ! 	  ;; Then decode the text encoding.
> ! 	  (save-restriction
> ! 	    (narrow-to-region e (point))
> ! 	    (goto-char e)
> ! 	    (while (not (eobp))
> ! 	      (let ((from (point))
> ! 		    (coding (get-text-property (point) 'coding)))
> ! 		(goto-char (next-single-property-change from coding nil 
> ! 							(point-max)))
> ! 		(if coding
> ! 		    (decode-coding-region from (point) coding)))))
>   	  (save-restriction
>   	    (narrow-to-region e (point))
>   	    (goto-char e)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
  2005-10-14 16:51   ` gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r Katsumi Yamaoka
@ 2005-10-15  0:46     ` Kenichi Handa
  2005-10-15  8:28       ` Katsumi Yamaoka
  0 siblings, 1 reply; 10+ messages in thread
From: Kenichi Handa @ 2005-10-15  0:46 UTC (permalink / raw)
  Cc: rms, bsam, ding

In article <b4my84wf3ez.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes:
>>  As I don't know the detail of rfc2047, I have not yet
>>  installed it.  Could you please check the code and install
>>  it (or a version that does the similar thing).

> Thank you for the patch, but I'm not sure whether dividing of
> encoded words in that way is rightful.  I need time to look into
> it.

Thank you.

---
Kenichi Handa
handa@m17n.org



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
  2005-10-15  0:46     ` Kenichi Handa
@ 2005-10-15  8:28       ` Katsumi Yamaoka
  2005-10-15  8:50         ` Kenichi Handa
  0 siblings, 1 reply; 10+ messages in thread
From: Katsumi Yamaoka @ 2005-10-15  8:28 UTC (permalink / raw)
  Cc: rms, bsam, ding

>>>>> In <E1EQaBp-0003ve-00@etlken> Kenichi Handa wrote:

> In article <b4my84wf3ez.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes:
>>>  As I don't know the detail of rfc2047, I have not yet
>>>  installed it.  Could you please check the code and install
>>>  it (or a version that does the similar thing).

>> Thank you for the patch, but I'm not sure whether dividing of
>> encoded words in that way is rightful.  I need time to look into
>> it.

> Thank you.

I confirmed Handa-san's patch is 99% perfect and doesn't lower
the performance.  However I hesitate to commit it to Gnus
because I found out the `MUST NOT' phrase in RFC2047 as follows:

5. Use of encoded-words in message headers

[...]

   The 'encoded-text' in an 'encoded-word' must be self-contained;
   'encoded-text' MUST NOT be continued from one 'encoded-word' to
   another.  This implies that the 'encoded-text' portion of a "B"
   'encoded-word' will be a multiple of 4 characters long; for a "Q"
   'encoded-word', any "=" character that appears in the 'encoded-text'
   portion will be followed by two hexadecimal characters.

The encoded-words that Boris B. Samorodov presented comes just
under this case.  Even so, should Gnus support such encodings?

>>>>> In <E1EQ6NU-000GJF-2s@bsam.ru>
>>>>>	"Boris B. Samorodov" <bsam@ipt.ru> wrote:

> Subject: =?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?=

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
  2005-10-15  8:28       ` Katsumi Yamaoka
@ 2005-10-15  8:50         ` Kenichi Handa
  2005-10-15 10:06           ` Katsumi Yamaoka
  0 siblings, 1 reply; 10+ messages in thread
From: Kenichi Handa @ 2005-10-15  8:50 UTC (permalink / raw)
  Cc: rms, bsam, ding, handa

In article <b4mll0vfake.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes:
> I confirmed Handa-san's patch is 99% perfect and doesn't lower
> the performance.  However I hesitate to commit it to Gnus
> because I found out the `MUST NOT' phrase in RFC2047 as follows:

> 5. Use of encoded-words in message headers

> [...]

>    The 'encoded-text' in an 'encoded-word' must be self-contained;
>    'encoded-text' MUST NOT be continued from one 'encoded-word' to
>    another.  This implies that the 'encoded-text' portion of a "B"
>    'encoded-word' will be a multiple of 4 characters long; for a "Q"
>    'encoded-word', any "=" character that appears in the 'encoded-text'
>    portion will be followed by two hexadecimal characters.

> The encoded-words that Boris B. Samorodov presented comes just
> under this case.  Even so, should Gnus support such encodings?

>>>>>>  In <E1EQ6NU-000GJF-2s@bsam.ru>
>>>>>> 	"Boris B. Samorodov" <bsam@ipt.ru> wrote:

>>  Subject: =?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?=

This example doesn't violate the above restriction.  Each
'encoded-word' is surely "multiple of 4 characters long".

Please note that the above restriction is for
'encoded-text', not for the underlining coded character set.
So, I think the above document doesn't prohibit diviging
UTF-8 byte sequence at non-character boundary.

---
Kenichi Handa
handa@m17n.org



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
  2005-10-15  8:50         ` Kenichi Handa
@ 2005-10-15 10:06           ` Katsumi Yamaoka
  2005-10-16  0:25             ` Kenichi Handa
  2005-10-18 18:20             ` Boris Samorodov
  0 siblings, 2 replies; 10+ messages in thread
From: Katsumi Yamaoka @ 2005-10-15 10:06 UTC (permalink / raw)
  Cc: rms, bsam, ding

>>>>> In <E1EQhkN-0001aF-00@etlken> Handa-san wrote:

>> 5. Use of encoded-words in message headers

>> [...]

>>    The 'encoded-text' in an 'encoded-word' must be self-contained;
>>    'encoded-text' MUST NOT be continued from one 'encoded-word' to
>>    another.  This implies that the 'encoded-text' portion of a "B"
>>    'encoded-word' will be a multiple of 4 characters long; for a "Q"
>>    'encoded-word', any "=" character that appears in the 'encoded-text'
>>    portion will be followed by two hexadecimal characters.

>> The encoded-words that Boris B. Samorodov presented comes just
>> under this case.  Even so, should Gnus support such encodings?

>>>  Subject: =?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?=

> This example doesn't violate the above restriction.  Each
> 'encoded-word' is surely "multiple of 4 characters long".

> Please note that the above restriction is for
> 'encoded-text', not for the underlining coded character set.
> So, I think the above document doesn't prohibit diviging
> UTF-8 byte sequence at non-character boundary.

I agree.  Thank you for clarifying it.  I've committed your
patch to cvs.gnus.org with small modifications.  It will be
propagated to Emacs soon.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
  2005-10-15 10:06           ` Katsumi Yamaoka
@ 2005-10-16  0:25             ` Kenichi Handa
  2005-10-18 18:20             ` Boris Samorodov
  1 sibling, 0 replies; 10+ messages in thread
From: Kenichi Handa @ 2005-10-16  0:25 UTC (permalink / raw)
  Cc: rms, bsam, ding

In article <b4md5m7p003.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes:

[...]
> I agree.  Thank you for clarifying it.  I've committed your
> patch to cvs.gnus.org with small modifications.  It will be
> propagated to Emacs soon.

Thank you.

---
Kenichi Handa
handa@m17n.org



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
  2005-10-15 10:06           ` Katsumi Yamaoka
  2005-10-16  0:25             ` Kenichi Handa
@ 2005-10-18 18:20             ` Boris Samorodov
  2005-10-19  4:12               ` Katsumi Yamaoka
  1 sibling, 1 reply; 10+ messages in thread
From: Boris Samorodov @ 2005-10-18 18:20 UTC (permalink / raw)
  Cc: Kenichi Handa, rms, ding

On Sat, 15 Oct 2005 19:06:52 +0900 Katsumi Yamaoka wrote:

> >>>>> In <E1EQhkN-0001aF-00@etlken> Handa-san wrote:

> >> 5. Use of encoded-words in message headers

> >> [...]

> >>    The 'encoded-text' in an 'encoded-word' must be self-contained;
> >>    'encoded-text' MUST NOT be continued from one 'encoded-word' to
> >>    another.  This implies that the 'encoded-text' portion of a "B"
> >>    'encoded-word' will be a multiple of 4 characters long; for a "Q"
> >>    'encoded-word', any "=" character that appears in the 'encoded-text'
> >>    portion will be followed by two hexadecimal characters.

> >> The encoded-words that Boris B. Samorodov presented comes just
> >> under this case.  Even so, should Gnus support such encodings?

> >>>  Subject: =?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?=

> > This example doesn't violate the above restriction.  Each
> > 'encoded-word' is surely "multiple of 4 characters long".

> > Please note that the above restriction is for
> > 'encoded-text', not for the underlining coded character set.
> > So, I think the above document doesn't prohibit diviging
> > UTF-8 byte sequence at non-character boundary.

> I agree.  Thank you for clarifying it.  I've committed your
> patch to cvs.gnus.org with small modifications.  It will be
> propagated to Emacs soon.


This is to confirm that the latest revision 7.43 from HEAD
for gnus/lisp/rfc2047.el from gnus cvs is fine with Subject and From
fields.

Thank you all who helped to investigate and unbreak the case!

Should I confirm the success story anywhere else (maybe
bug-gnu-emacs@)?


WBR
-- 
bsam



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
  2005-10-18 18:20             ` Boris Samorodov
@ 2005-10-19  4:12               ` Katsumi Yamaoka
  2005-10-19 20:16                 ` Richard M. Stallman
  0 siblings, 1 reply; 10+ messages in thread
From: Katsumi Yamaoka @ 2005-10-19  4:12 UTC (permalink / raw)
  Cc: Kenichi Handa, rms, ding

>>>>> In <20421354@serv3.int.kfs.ru> Boris Samorodov wrote:

> This is to confirm that the latest revision 7.43 from HEAD
> for gnus/lisp/rfc2047.el from gnus cvs is fine with Subject and From
> fields.

> Thank you all who helped to investigate and unbreak the case!

You're welcome.  After discussing with well-informed people in
Japan, we came to recognize such an encoding (to divide encoded
text in the place of not character boundaries) violates RFC2047,
the section 5.  Here's an extract:

   Each 'encoded-word' MUST represent an integral number of characters.
   A multi-octet character may not be split across adjacent 'encoded-
   word's.

It doesn't mean to prohibit to try to decode them though, and
that Gnus does it would be nice.

> Should I confirm the success story anywhere else (maybe
> bug-gnu-emacs@)?

There's no necessity, maybe.

BTW, I realized that that fix was insufficient.  For instance,
it will display binary garbage if the charset specified in the
encoded-word is unknown.  So, I will CVS commit the new code
after a while.

Regards,

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
  2005-10-19  4:12               ` Katsumi Yamaoka
@ 2005-10-19 20:16                 ` Richard M. Stallman
  0 siblings, 0 replies; 10+ messages in thread
From: Richard M. Stallman @ 2005-10-19 20:16 UTC (permalink / raw)
  Cc: bsam, handa, ding

       Each 'encoded-word' MUST represent an integral number of characters.
       A multi-octet character may not be split across adjacent 'encoded-
       word's.

    It doesn't mean to prohibit to try to decode them though, and
    that Gnus does it would be nice.

I agree.  Maybe mailers should not generate this,
but if they do, it is better for Gnus to handle it right.

    BTW, I realized that that fix was insufficient.  For instance,
    it will display binary garbage if the charset specified in the
    encoded-word is unknown.  So, I will CVS commit the new code
    after a while.

Thank you in advance.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
       [not found] <E1EQ6NU-000GJF-2s@bsam.ru>
@ 2005-10-13 18:26 ` Reiner Steib
  0 siblings, 0 replies; 10+ messages in thread
From: Reiner Steib @ 2005-10-13 18:26 UTC (permalink / raw)
  Cc: emacs-pretest-bug, Ding List

On Thu, Oct 13 2005, Boris B. Samorodov wrote:

[ On emacs-pretest.  Cc-ing Ding ]

> Symptoms:
>
> I do have a letter with the next Subject:
> -----
> Subject: =?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?=
> -----
>
> In command-line mode I can do...
>
> $ echo "W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQnyDRgtC10YHRgg==" | base64 -d | iconv -f utf-8
>
> ...and receive the answer:
>
> [ipt.ru #163] АвтоОтвет: МСК: СП тест
>
> But gnus (from cvs as emacs) shows the next line...
>
> Subject: [ipt.ru #163] АвтоОтвет: МСК: СП тест
>
> ...which is wrong.

I don't see any difference.  Maybe I'm misunderstanding what you mean.

> The bug appeared to be at illegal concatenation of
> =?UTF-8?<foo> =?UTF-8?<bar> parts of the Subject.

Whitespace between adjacent encoded words have to be ignored according
to RFC 2047:

,----[ rfc2047.txt ]
|    (=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=)     (ab)
| 
|            White space between adjacent 'encoded-word's is not
|            displayed.
| 
|    (=?ISO-8859-1?Q?a?=  =?ISO-8859-1?Q?b?=)    (ab)
| 
|         Even multiple SPACEs between 'encoded-word's are ignored
|         for the purpose of display.
| 
|    (=?ISO-8859-1?Q?a?=                         (ab)
|        =?ISO-8859-1?Q?b?=)
| 
|            Any amount of linear-space-white between 'encoded-word's,
|            even if it includes a CRLF followed by one or more SPACEs,
|            is ignored for the purposes of display.
`----

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-10-19 20:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <E1EQHq4-0002rQ-EC@fencepost.gnu.org>
     [not found] ` <E1EQPSa-0006iC-00@etlken>
2005-10-14 16:51   ` gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r Katsumi Yamaoka
2005-10-15  0:46     ` Kenichi Handa
2005-10-15  8:28       ` Katsumi Yamaoka
2005-10-15  8:50         ` Kenichi Handa
2005-10-15 10:06           ` Katsumi Yamaoka
2005-10-16  0:25             ` Kenichi Handa
2005-10-18 18:20             ` Boris Samorodov
2005-10-19  4:12               ` Katsumi Yamaoka
2005-10-19 20:16                 ` Richard M. Stallman
     [not found] <E1EQ6NU-000GJF-2s@bsam.ru>
2005-10-13 18:26 ` Reiner Steib

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).