* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
[not found] ` <E1EQPSa-0006iC-00@etlken>
@ 2005-10-14 16:51 ` Katsumi Yamaoka
2005-10-15 0:46 ` Kenichi Handa
0 siblings, 1 reply; 10+ messages in thread
From: Katsumi Yamaoka @ 2005-10-14 16:51 UTC (permalink / raw)
Cc: rms, bsam, ding
I added the ding list to Cc:.
>>>>> In <E1EQPSa-0006iC-00@etlken> Handa-san wrote:
> Ok. Yamaoka-san, it seems that you are the last modifier of
> the relevant part of rfc2047.el, so I include you in CC:.
I'd improved only the rfc2047.el encoder, but that's ok.
>>>>> In <E1EQ6NU-000GJF-2s@bsam.ru>
>>>>> "Boris B. Samorodov" <bsam@ipt.ru> wrote:
>> The bug appeared to be at illegal concatenation of
>> =?UTF-8?<foo> =?UTF-8?<bar> parts of the Subject.
> Yes, the bug is in the way of handling that parts.
> The current code does this:
> (1) Remove spaces between encoded words.
> (2) Decode content-transfer-encoding of <foo> and decode the
> resulting text by utf-8, then decode
> content-transfer-encoding of <bar> and decode the resulting
> text by utf-8.
> But it doesn't work if <foo> and <bar> are devided not at
> character boundary of utf-8. The above case is this.
I see. The sample that Boris B. Samorodov brought up to the
pretest-bug list first gives the following result:
(prin1
(rfc2047-decode-string
"=?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?="))
"[ipt.ru #163] АвтоОтвет: МСК: С\xc3\x90\xc2\x9f тест"
And FLIM's decoder does the same.
> So what we should do is:
> (2') Decode content-transfer-encoding of <foo> and <bar>
> while keeping information of coding system (utf-8 in this
> case) on each part. Then decode the text encoding of a run
> that has the same coding system at once.
> I'll attach a sample patch for doing that. I modified
> rfc2047-decode-string with a helper function
> rfc2047-code-cte.
> As I don't know the detail of rfc2047, I have not yet
> installed it. Could you please check the code and install
> it (or a version that does the similar thing).
Thank you for the patch, but I'm not sure whether dividing of
encoded words in that way is rightful. I need time to look into
it.
> *** rfc2047.el 08 Aug 2005 10:13:38 +0900 1.22
> --- rfc2047.el 14 Oct 2005 22:16:03 +0900
> ***************
> *** 822,827 ****
> --- 822,843 ----
> ;; and worthwhile (is it more correct or not?), e.g. something like
> ;; `=?iso-8859-1?q?foo?=@'.
> + (defun rfc2047-decode-cte (charset encoding word)
> + "Decode content-transfer-encoding of WORD by ENCODING.
> + Put text property `coding' to the decoded word with value a coding system
> + derived from CHARSET."
> + (cond ((char-equal ?B encoding)
> + (setq word (base64-decode-string (rfc2047-pad-base64 word))))
> + ((char-equal ?Q encoding)
> + (setq word (quoted-printable-decode-string
> + (mm-subst-char-in-string ?_ ? word t))))
> + (t (error "Invalid encoding: %c" encoding)))
> + (setq word (string-to-multibyte word))
> + (setq charset (intern (downcase charset)))
> + (put-text-property 0 (length word)
> + 'coding (mm-charset-to-coding-system charset) word)
> + word)
> +
> (defun rfc2047-decode-region (start end)
> "Decode MIME-encoded words in region between START and END."
> (interactive "r")
> ***************
> *** 842,857 ****
> ;; Decode the encoded words.
> (setq b (goto-char (point-min)))
> (while (re-search-forward rfc2047-encoded-word-regexp nil t)
> (setq e (match-beginning 0))
> ! (insert (rfc2047-parse-and-decode
> ! (prog1
> ! (match-string 0)
> ! (delete-region e (match-end 0)))))
> ! (while (looking-at rfc2047-encoded-word-regexp)
> ! (insert (rfc2047-parse-and-decode
> ! (prog1
> ! (match-string 0)
> ! (delete-region (point) (match-end 0))))))
> (save-restriction
> (narrow-to-region e (point))
> (goto-char e)
> --- 858,888 ----
> ;; Decode the encoded words.
> (setq b (goto-char (point-min)))
> (while (re-search-forward rfc2047-encoded-word-regexp nil t)
> + ;; At first, decode content-transfer-encoding of the
> + ;; succeeding encoded words.
> (setq e (match-beginning 0))
> ! (let ((charset (match-string 1))
> ! (encoding (char-after (match-beginning 3)))
> ! (word (match-string 4)))
> ! (delete-region e (match-end 0))
> ! (insert (rfc2047-decode-cte charset encoding word))
> ! (while (looking-at rfc2047-encoded-word-regexp)
> ! (setq charset (match-string 1)
> ! encoding (char-after (match-beginning 3))
> ! word (match-string 4))
> ! (delete-region (point) (match-end 0))
> ! (insert (rfc2047-decode-cte charset encoding word))))
> ! ;; Then decode the text encoding.
> ! (save-restriction
> ! (narrow-to-region e (point))
> ! (goto-char e)
> ! (while (not (eobp))
> ! (let ((from (point))
> ! (coding (get-text-property (point) 'coding)))
> ! (goto-char (next-single-property-change from coding nil
> ! (point-max)))
> ! (if coding
> ! (decode-coding-region from (point) coding)))))
> (save-restriction
> (narrow-to-region e (point))
> (goto-char e)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
2005-10-14 16:51 ` gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r Katsumi Yamaoka
@ 2005-10-15 0:46 ` Kenichi Handa
2005-10-15 8:28 ` Katsumi Yamaoka
0 siblings, 1 reply; 10+ messages in thread
From: Kenichi Handa @ 2005-10-15 0:46 UTC (permalink / raw)
Cc: rms, bsam, ding
In article <b4my84wf3ez.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes:
>> As I don't know the detail of rfc2047, I have not yet
>> installed it. Could you please check the code and install
>> it (or a version that does the similar thing).
> Thank you for the patch, but I'm not sure whether dividing of
> encoded words in that way is rightful. I need time to look into
> it.
Thank you.
---
Kenichi Handa
handa@m17n.org
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
2005-10-15 0:46 ` Kenichi Handa
@ 2005-10-15 8:28 ` Katsumi Yamaoka
2005-10-15 8:50 ` Kenichi Handa
0 siblings, 1 reply; 10+ messages in thread
From: Katsumi Yamaoka @ 2005-10-15 8:28 UTC (permalink / raw)
Cc: rms, bsam, ding
>>>>> In <E1EQaBp-0003ve-00@etlken> Kenichi Handa wrote:
> In article <b4my84wf3ez.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes:
>>> As I don't know the detail of rfc2047, I have not yet
>>> installed it. Could you please check the code and install
>>> it (or a version that does the similar thing).
>> Thank you for the patch, but I'm not sure whether dividing of
>> encoded words in that way is rightful. I need time to look into
>> it.
> Thank you.
I confirmed Handa-san's patch is 99% perfect and doesn't lower
the performance. However I hesitate to commit it to Gnus
because I found out the `MUST NOT' phrase in RFC2047 as follows:
5. Use of encoded-words in message headers
[...]
The 'encoded-text' in an 'encoded-word' must be self-contained;
'encoded-text' MUST NOT be continued from one 'encoded-word' to
another. This implies that the 'encoded-text' portion of a "B"
'encoded-word' will be a multiple of 4 characters long; for a "Q"
'encoded-word', any "=" character that appears in the 'encoded-text'
portion will be followed by two hexadecimal characters.
The encoded-words that Boris B. Samorodov presented comes just
under this case. Even so, should Gnus support such encodings?
>>>>> In <E1EQ6NU-000GJF-2s@bsam.ru>
>>>>> "Boris B. Samorodov" <bsam@ipt.ru> wrote:
> Subject: =?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?=
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
2005-10-15 8:28 ` Katsumi Yamaoka
@ 2005-10-15 8:50 ` Kenichi Handa
2005-10-15 10:06 ` Katsumi Yamaoka
0 siblings, 1 reply; 10+ messages in thread
From: Kenichi Handa @ 2005-10-15 8:50 UTC (permalink / raw)
Cc: rms, bsam, ding, handa
In article <b4mll0vfake.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes:
> I confirmed Handa-san's patch is 99% perfect and doesn't lower
> the performance. However I hesitate to commit it to Gnus
> because I found out the `MUST NOT' phrase in RFC2047 as follows:
> 5. Use of encoded-words in message headers
> [...]
> The 'encoded-text' in an 'encoded-word' must be self-contained;
> 'encoded-text' MUST NOT be continued from one 'encoded-word' to
> another. This implies that the 'encoded-text' portion of a "B"
> 'encoded-word' will be a multiple of 4 characters long; for a "Q"
> 'encoded-word', any "=" character that appears in the 'encoded-text'
> portion will be followed by two hexadecimal characters.
> The encoded-words that Boris B. Samorodov presented comes just
> under this case. Even so, should Gnus support such encodings?
>>>>>> In <E1EQ6NU-000GJF-2s@bsam.ru>
>>>>>> "Boris B. Samorodov" <bsam@ipt.ru> wrote:
>> Subject: =?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?=
This example doesn't violate the above restriction. Each
'encoded-word' is surely "multiple of 4 characters long".
Please note that the above restriction is for
'encoded-text', not for the underlining coded character set.
So, I think the above document doesn't prohibit diviging
UTF-8 byte sequence at non-character boundary.
---
Kenichi Handa
handa@m17n.org
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
2005-10-15 8:50 ` Kenichi Handa
@ 2005-10-15 10:06 ` Katsumi Yamaoka
2005-10-16 0:25 ` Kenichi Handa
2005-10-18 18:20 ` Boris Samorodov
0 siblings, 2 replies; 10+ messages in thread
From: Katsumi Yamaoka @ 2005-10-15 10:06 UTC (permalink / raw)
Cc: rms, bsam, ding
>>>>> In <E1EQhkN-0001aF-00@etlken> Handa-san wrote:
>> 5. Use of encoded-words in message headers
>> [...]
>> The 'encoded-text' in an 'encoded-word' must be self-contained;
>> 'encoded-text' MUST NOT be continued from one 'encoded-word' to
>> another. This implies that the 'encoded-text' portion of a "B"
>> 'encoded-word' will be a multiple of 4 characters long; for a "Q"
>> 'encoded-word', any "=" character that appears in the 'encoded-text'
>> portion will be followed by two hexadecimal characters.
>> The encoded-words that Boris B. Samorodov presented comes just
>> under this case. Even so, should Gnus support such encodings?
>>> Subject: =?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?=
> This example doesn't violate the above restriction. Each
> 'encoded-word' is surely "multiple of 4 characters long".
> Please note that the above restriction is for
> 'encoded-text', not for the underlining coded character set.
> So, I think the above document doesn't prohibit diviging
> UTF-8 byte sequence at non-character boundary.
I agree. Thank you for clarifying it. I've committed your
patch to cvs.gnus.org with small modifications. It will be
propagated to Emacs soon.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
2005-10-15 10:06 ` Katsumi Yamaoka
@ 2005-10-16 0:25 ` Kenichi Handa
2005-10-18 18:20 ` Boris Samorodov
1 sibling, 0 replies; 10+ messages in thread
From: Kenichi Handa @ 2005-10-16 0:25 UTC (permalink / raw)
Cc: rms, bsam, ding
In article <b4md5m7p003.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes:
[...]
> I agree. Thank you for clarifying it. I've committed your
> patch to cvs.gnus.org with small modifications. It will be
> propagated to Emacs soon.
Thank you.
---
Kenichi Handa
handa@m17n.org
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
2005-10-15 10:06 ` Katsumi Yamaoka
2005-10-16 0:25 ` Kenichi Handa
@ 2005-10-18 18:20 ` Boris Samorodov
2005-10-19 4:12 ` Katsumi Yamaoka
1 sibling, 1 reply; 10+ messages in thread
From: Boris Samorodov @ 2005-10-18 18:20 UTC (permalink / raw)
Cc: Kenichi Handa, rms, ding
On Sat, 15 Oct 2005 19:06:52 +0900 Katsumi Yamaoka wrote:
> >>>>> In <E1EQhkN-0001aF-00@etlken> Handa-san wrote:
> >> 5. Use of encoded-words in message headers
> >> [...]
> >> The 'encoded-text' in an 'encoded-word' must be self-contained;
> >> 'encoded-text' MUST NOT be continued from one 'encoded-word' to
> >> another. This implies that the 'encoded-text' portion of a "B"
> >> 'encoded-word' will be a multiple of 4 characters long; for a "Q"
> >> 'encoded-word', any "=" character that appears in the 'encoded-text'
> >> portion will be followed by two hexadecimal characters.
> >> The encoded-words that Boris B. Samorodov presented comes just
> >> under this case. Even so, should Gnus support such encodings?
> >>> Subject: =?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?=
> > This example doesn't violate the above restriction. Each
> > 'encoded-word' is surely "multiple of 4 characters long".
> > Please note that the above restriction is for
> > 'encoded-text', not for the underlining coded character set.
> > So, I think the above document doesn't prohibit diviging
> > UTF-8 byte sequence at non-character boundary.
> I agree. Thank you for clarifying it. I've committed your
> patch to cvs.gnus.org with small modifications. It will be
> propagated to Emacs soon.
This is to confirm that the latest revision 7.43 from HEAD
for gnus/lisp/rfc2047.el from gnus cvs is fine with Subject and From
fields.
Thank you all who helped to investigate and unbreak the case!
Should I confirm the success story anywhere else (maybe
bug-gnu-emacs@)?
WBR
--
bsam
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
2005-10-18 18:20 ` Boris Samorodov
@ 2005-10-19 4:12 ` Katsumi Yamaoka
2005-10-19 20:16 ` Richard M. Stallman
0 siblings, 1 reply; 10+ messages in thread
From: Katsumi Yamaoka @ 2005-10-19 4:12 UTC (permalink / raw)
Cc: Kenichi Handa, rms, ding
>>>>> In <20421354@serv3.int.kfs.ru> Boris Samorodov wrote:
> This is to confirm that the latest revision 7.43 from HEAD
> for gnus/lisp/rfc2047.el from gnus cvs is fine with Subject and From
> fields.
> Thank you all who helped to investigate and unbreak the case!
You're welcome. After discussing with well-informed people in
Japan, we came to recognize such an encoding (to divide encoded
text in the place of not character boundaries) violates RFC2047,
the section 5. Here's an extract:
Each 'encoded-word' MUST represent an integral number of characters.
A multi-octet character may not be split across adjacent 'encoded-
word's.
It doesn't mean to prohibit to try to decode them though, and
that Gnus does it would be nice.
> Should I confirm the success story anywhere else (maybe
> bug-gnu-emacs@)?
There's no necessity, maybe.
BTW, I realized that that fix was insufficient. For instance,
it will display binary garbage if the charset specified in the
encoded-word is unknown. So, I will CVS commit the new code
after a while.
Regards,
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
2005-10-19 4:12 ` Katsumi Yamaoka
@ 2005-10-19 20:16 ` Richard M. Stallman
0 siblings, 0 replies; 10+ messages in thread
From: Richard M. Stallman @ 2005-10-19 20:16 UTC (permalink / raw)
Cc: bsam, handa, ding
Each 'encoded-word' MUST represent an integral number of characters.
A multi-octet character may not be split across adjacent 'encoded-
word's.
It doesn't mean to prohibit to try to decode them though, and
that Gnus does it would be nice.
I agree. Maybe mailers should not generate this,
but if they do, it is better for Gnus to handle it right.
BTW, I realized that that fix was insufficient. For instance,
it will display binary garbage if the charset specified in the
encoded-word is unknown. So, I will CVS commit the new code
after a while.
Thank you in advance.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r
[not found] <E1EQ6NU-000GJF-2s@bsam.ru>
@ 2005-10-13 18:26 ` Reiner Steib
0 siblings, 0 replies; 10+ messages in thread
From: Reiner Steib @ 2005-10-13 18:26 UTC (permalink / raw)
Cc: emacs-pretest-bug, Ding List
On Thu, Oct 13 2005, Boris B. Samorodov wrote:
[ On emacs-pretest. Cc-ing Ding ]
> Symptoms:
>
> I do have a letter with the next Subject:
> -----
> Subject: =?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?=
> -----
>
> In command-line mode I can do...
>
> $ echo "W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQnyDRgtC10YHRgg==" | base64 -d | iconv -f utf-8
>
> ...and receive the answer:
>
> [ipt.ru #163] АвтоОтвет: МСК: СП тест
>
> But gnus (from cvs as emacs) shows the next line...
>
> Subject: [ipt.ru #163] АвтоОтвет: МСК: СП тест
>
> ...which is wrong.
I don't see any difference. Maybe I'm misunderstanding what you mean.
> The bug appeared to be at illegal concatenation of
> =?UTF-8?<foo> =?UTF-8?<bar> parts of the Subject.
Whitespace between adjacent encoded words have to be ignored according
to RFC 2047:
,----[ rfc2047.txt ]
| (=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=) (ab)
|
| White space between adjacent 'encoded-word's is not
| displayed.
|
| (=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=) (ab)
|
| Even multiple SPACEs between 'encoded-word's are ignored
| for the purpose of display.
|
| (=?ISO-8859-1?Q?a?= (ab)
| =?ISO-8859-1?Q?b?=)
|
| Any amount of linear-space-white between 'encoded-word's,
| even if it includes a CRLF followed by one or more SPACEs,
| is ignored for the purposes of display.
`----
Bye, Reiner.
--
,,,
(o o)
---ooO-(_)-Ooo--- | PGP key available | http://rsteib.home.pages.de/
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2005-10-19 20:16 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <E1EQHq4-0002rQ-EC@fencepost.gnu.org>
[not found] ` <E1EQPSa-0006iC-00@etlken>
2005-10-14 16:51 ` gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r Katsumi Yamaoka
2005-10-15 0:46 ` Kenichi Handa
2005-10-15 8:28 ` Katsumi Yamaoka
2005-10-15 8:50 ` Kenichi Handa
2005-10-15 10:06 ` Katsumi Yamaoka
2005-10-16 0:25 ` Kenichi Handa
2005-10-18 18:20 ` Boris Samorodov
2005-10-19 4:12 ` Katsumi Yamaoka
2005-10-19 20:16 ` Richard M. Stallman
[not found] <E1EQ6NU-000GJF-2s@bsam.ru>
2005-10-13 18:26 ` Reiner Steib
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).