* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r [not found] ` <E1EQPSa-0006iC-00@etlken> @ 2005-10-14 16:51 ` Katsumi Yamaoka 2005-10-15 0:46 ` Kenichi Handa 0 siblings, 1 reply; 10+ messages in thread From: Katsumi Yamaoka @ 2005-10-14 16:51 UTC (permalink / raw) Cc: rms, bsam, ding I added the ding list to Cc:. >>>>> In <E1EQPSa-0006iC-00@etlken> Handa-san wrote: > Ok. Yamaoka-san, it seems that you are the last modifier of > the relevant part of rfc2047.el, so I include you in CC:. I'd improved only the rfc2047.el encoder, but that's ok. >>>>> In <E1EQ6NU-000GJF-2s@bsam.ru> >>>>> "Boris B. Samorodov" <bsam@ipt.ru> wrote: >> The bug appeared to be at illegal concatenation of >> =?UTF-8?<foo> =?UTF-8?<bar> parts of the Subject. > Yes, the bug is in the way of handling that parts. > The current code does this: > (1) Remove spaces between encoded words. > (2) Decode content-transfer-encoding of <foo> and decode the > resulting text by utf-8, then decode > content-transfer-encoding of <bar> and decode the resulting > text by utf-8. > But it doesn't work if <foo> and <bar> are devided not at > character boundary of utf-8. The above case is this. I see. The sample that Boris B. Samorodov brought up to the pretest-bug list first gives the following result: (prin1 (rfc2047-decode-string "=?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?=")) "[ipt.ru #163] АвтоОтвет: МСК: С\xc3\x90\xc2\x9f тест" And FLIM's decoder does the same. > So what we should do is: > (2') Decode content-transfer-encoding of <foo> and <bar> > while keeping information of coding system (utf-8 in this > case) on each part. Then decode the text encoding of a run > that has the same coding system at once. > I'll attach a sample patch for doing that. I modified > rfc2047-decode-string with a helper function > rfc2047-code-cte. > As I don't know the detail of rfc2047, I have not yet > installed it. Could you please check the code and install > it (or a version that does the similar thing). Thank you for the patch, but I'm not sure whether dividing of encoded words in that way is rightful. I need time to look into it. > *** rfc2047.el 08 Aug 2005 10:13:38 +0900 1.22 > --- rfc2047.el 14 Oct 2005 22:16:03 +0900 > *************** > *** 822,827 **** > --- 822,843 ---- > ;; and worthwhile (is it more correct or not?), e.g. something like > ;; `=?iso-8859-1?q?foo?=@'. > + (defun rfc2047-decode-cte (charset encoding word) > + "Decode content-transfer-encoding of WORD by ENCODING. > + Put text property `coding' to the decoded word with value a coding system > + derived from CHARSET." > + (cond ((char-equal ?B encoding) > + (setq word (base64-decode-string (rfc2047-pad-base64 word)))) > + ((char-equal ?Q encoding) > + (setq word (quoted-printable-decode-string > + (mm-subst-char-in-string ?_ ? word t)))) > + (t (error "Invalid encoding: %c" encoding))) > + (setq word (string-to-multibyte word)) > + (setq charset (intern (downcase charset))) > + (put-text-property 0 (length word) > + 'coding (mm-charset-to-coding-system charset) word) > + word) > + > (defun rfc2047-decode-region (start end) > "Decode MIME-encoded words in region between START and END." > (interactive "r") > *************** > *** 842,857 **** > ;; Decode the encoded words. > (setq b (goto-char (point-min))) > (while (re-search-forward rfc2047-encoded-word-regexp nil t) > (setq e (match-beginning 0)) > ! (insert (rfc2047-parse-and-decode > ! (prog1 > ! (match-string 0) > ! (delete-region e (match-end 0))))) > ! (while (looking-at rfc2047-encoded-word-regexp) > ! (insert (rfc2047-parse-and-decode > ! (prog1 > ! (match-string 0) > ! (delete-region (point) (match-end 0)))))) > (save-restriction > (narrow-to-region e (point)) > (goto-char e) > --- 858,888 ---- > ;; Decode the encoded words. > (setq b (goto-char (point-min))) > (while (re-search-forward rfc2047-encoded-word-regexp nil t) > + ;; At first, decode content-transfer-encoding of the > + ;; succeeding encoded words. > (setq e (match-beginning 0)) > ! (let ((charset (match-string 1)) > ! (encoding (char-after (match-beginning 3))) > ! (word (match-string 4))) > ! (delete-region e (match-end 0)) > ! (insert (rfc2047-decode-cte charset encoding word)) > ! (while (looking-at rfc2047-encoded-word-regexp) > ! (setq charset (match-string 1) > ! encoding (char-after (match-beginning 3)) > ! word (match-string 4)) > ! (delete-region (point) (match-end 0)) > ! (insert (rfc2047-decode-cte charset encoding word)))) > ! ;; Then decode the text encoding. > ! (save-restriction > ! (narrow-to-region e (point)) > ! (goto-char e) > ! (while (not (eobp)) > ! (let ((from (point)) > ! (coding (get-text-property (point) 'coding))) > ! (goto-char (next-single-property-change from coding nil > ! (point-max))) > ! (if coding > ! (decode-coding-region from (point) coding))))) > (save-restriction > (narrow-to-region e (point)) > (goto-char e) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r 2005-10-14 16:51 ` gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r Katsumi Yamaoka @ 2005-10-15 0:46 ` Kenichi Handa 2005-10-15 8:28 ` Katsumi Yamaoka 0 siblings, 1 reply; 10+ messages in thread From: Kenichi Handa @ 2005-10-15 0:46 UTC (permalink / raw) Cc: rms, bsam, ding In article <b4my84wf3ez.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes: >> As I don't know the detail of rfc2047, I have not yet >> installed it. Could you please check the code and install >> it (or a version that does the similar thing). > Thank you for the patch, but I'm not sure whether dividing of > encoded words in that way is rightful. I need time to look into > it. Thank you. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r 2005-10-15 0:46 ` Kenichi Handa @ 2005-10-15 8:28 ` Katsumi Yamaoka 2005-10-15 8:50 ` Kenichi Handa 0 siblings, 1 reply; 10+ messages in thread From: Katsumi Yamaoka @ 2005-10-15 8:28 UTC (permalink / raw) Cc: rms, bsam, ding >>>>> In <E1EQaBp-0003ve-00@etlken> Kenichi Handa wrote: > In article <b4my84wf3ez.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes: >>> As I don't know the detail of rfc2047, I have not yet >>> installed it. Could you please check the code and install >>> it (or a version that does the similar thing). >> Thank you for the patch, but I'm not sure whether dividing of >> encoded words in that way is rightful. I need time to look into >> it. > Thank you. I confirmed Handa-san's patch is 99% perfect and doesn't lower the performance. However I hesitate to commit it to Gnus because I found out the `MUST NOT' phrase in RFC2047 as follows: 5. Use of encoded-words in message headers [...] The 'encoded-text' in an 'encoded-word' must be self-contained; 'encoded-text' MUST NOT be continued from one 'encoded-word' to another. This implies that the 'encoded-text' portion of a "B" 'encoded-word' will be a multiple of 4 characters long; for a "Q" 'encoded-word', any "=" character that appears in the 'encoded-text' portion will be followed by two hexadecimal characters. The encoded-words that Boris B. Samorodov presented comes just under this case. Even so, should Gnus support such encodings? >>>>> In <E1EQ6NU-000GJF-2s@bsam.ru> >>>>> "Boris B. Samorodov" <bsam@ipt.ru> wrote: > Subject: =?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?= ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r 2005-10-15 8:28 ` Katsumi Yamaoka @ 2005-10-15 8:50 ` Kenichi Handa 2005-10-15 10:06 ` Katsumi Yamaoka 0 siblings, 1 reply; 10+ messages in thread From: Kenichi Handa @ 2005-10-15 8:50 UTC (permalink / raw) Cc: rms, bsam, ding, handa In article <b4mll0vfake.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes: > I confirmed Handa-san's patch is 99% perfect and doesn't lower > the performance. However I hesitate to commit it to Gnus > because I found out the `MUST NOT' phrase in RFC2047 as follows: > 5. Use of encoded-words in message headers > [...] > The 'encoded-text' in an 'encoded-word' must be self-contained; > 'encoded-text' MUST NOT be continued from one 'encoded-word' to > another. This implies that the 'encoded-text' portion of a "B" > 'encoded-word' will be a multiple of 4 characters long; for a "Q" > 'encoded-word', any "=" character that appears in the 'encoded-text' > portion will be followed by two hexadecimal characters. > The encoded-words that Boris B. Samorodov presented comes just > under this case. Even so, should Gnus support such encodings? >>>>>> In <E1EQ6NU-000GJF-2s@bsam.ru> >>>>>> "Boris B. Samorodov" <bsam@ipt.ru> wrote: >> Subject: =?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?= This example doesn't violate the above restriction. Each 'encoded-word' is surely "multiple of 4 characters long". Please note that the above restriction is for 'encoded-text', not for the underlining coded character set. So, I think the above document doesn't prohibit diviging UTF-8 byte sequence at non-character boundary. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r 2005-10-15 8:50 ` Kenichi Handa @ 2005-10-15 10:06 ` Katsumi Yamaoka 2005-10-16 0:25 ` Kenichi Handa 2005-10-18 18:20 ` Boris Samorodov 0 siblings, 2 replies; 10+ messages in thread From: Katsumi Yamaoka @ 2005-10-15 10:06 UTC (permalink / raw) Cc: rms, bsam, ding >>>>> In <E1EQhkN-0001aF-00@etlken> Handa-san wrote: >> 5. Use of encoded-words in message headers >> [...] >> The 'encoded-text' in an 'encoded-word' must be self-contained; >> 'encoded-text' MUST NOT be continued from one 'encoded-word' to >> another. This implies that the 'encoded-text' portion of a "B" >> 'encoded-word' will be a multiple of 4 characters long; for a "Q" >> 'encoded-word', any "=" character that appears in the 'encoded-text' >> portion will be followed by two hexadecimal characters. >> The encoded-words that Boris B. Samorodov presented comes just >> under this case. Even so, should Gnus support such encodings? >>> Subject: =?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?= > This example doesn't violate the above restriction. Each > 'encoded-word' is surely "multiple of 4 characters long". > Please note that the above restriction is for > 'encoded-text', not for the underlining coded character set. > So, I think the above document doesn't prohibit diviging > UTF-8 byte sequence at non-character boundary. I agree. Thank you for clarifying it. I've committed your patch to cvs.gnus.org with small modifications. It will be propagated to Emacs soon. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r 2005-10-15 10:06 ` Katsumi Yamaoka @ 2005-10-16 0:25 ` Kenichi Handa 2005-10-18 18:20 ` Boris Samorodov 1 sibling, 0 replies; 10+ messages in thread From: Kenichi Handa @ 2005-10-16 0:25 UTC (permalink / raw) Cc: rms, bsam, ding In article <b4md5m7p003.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes: [...] > I agree. Thank you for clarifying it. I've committed your > patch to cvs.gnus.org with small modifications. It will be > propagated to Emacs soon. Thank you. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r 2005-10-15 10:06 ` Katsumi Yamaoka 2005-10-16 0:25 ` Kenichi Handa @ 2005-10-18 18:20 ` Boris Samorodov 2005-10-19 4:12 ` Katsumi Yamaoka 1 sibling, 1 reply; 10+ messages in thread From: Boris Samorodov @ 2005-10-18 18:20 UTC (permalink / raw) Cc: Kenichi Handa, rms, ding On Sat, 15 Oct 2005 19:06:52 +0900 Katsumi Yamaoka wrote: > >>>>> In <E1EQhkN-0001aF-00@etlken> Handa-san wrote: > >> 5. Use of encoded-words in message headers > >> [...] > >> The 'encoded-text' in an 'encoded-word' must be self-contained; > >> 'encoded-text' MUST NOT be continued from one 'encoded-word' to > >> another. This implies that the 'encoded-text' portion of a "B" > >> 'encoded-word' will be a multiple of 4 characters long; for a "Q" > >> 'encoded-word', any "=" character that appears in the 'encoded-text' > >> portion will be followed by two hexadecimal characters. > >> The encoded-words that Boris B. Samorodov presented comes just > >> under this case. Even so, should Gnus support such encodings? > >>> Subject: =?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?= > > This example doesn't violate the above restriction. Each > > 'encoded-word' is surely "multiple of 4 characters long". > > Please note that the above restriction is for > > 'encoded-text', not for the underlining coded character set. > > So, I think the above document doesn't prohibit diviging > > UTF-8 byte sequence at non-character boundary. > I agree. Thank you for clarifying it. I've committed your > patch to cvs.gnus.org with small modifications. It will be > propagated to Emacs soon. This is to confirm that the latest revision 7.43 from HEAD for gnus/lisp/rfc2047.el from gnus cvs is fine with Subject and From fields. Thank you all who helped to investigate and unbreak the case! Should I confirm the success story anywhere else (maybe bug-gnu-emacs@)? WBR -- bsam ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r 2005-10-18 18:20 ` Boris Samorodov @ 2005-10-19 4:12 ` Katsumi Yamaoka 2005-10-19 20:16 ` Richard M. Stallman 0 siblings, 1 reply; 10+ messages in thread From: Katsumi Yamaoka @ 2005-10-19 4:12 UTC (permalink / raw) Cc: Kenichi Handa, rms, ding >>>>> In <20421354@serv3.int.kfs.ru> Boris Samorodov wrote: > This is to confirm that the latest revision 7.43 from HEAD > for gnus/lisp/rfc2047.el from gnus cvs is fine with Subject and From > fields. > Thank you all who helped to investigate and unbreak the case! You're welcome. After discussing with well-informed people in Japan, we came to recognize such an encoding (to divide encoded text in the place of not character boundaries) violates RFC2047, the section 5. Here's an extract: Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded- word's. It doesn't mean to prohibit to try to decode them though, and that Gnus does it would be nice. > Should I confirm the success story anywhere else (maybe > bug-gnu-emacs@)? There's no necessity, maybe. BTW, I realized that that fix was insufficient. For instance, it will display binary garbage if the charset specified in the encoded-word is unknown. So, I will CVS commit the new code after a while. Regards, ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r 2005-10-19 4:12 ` Katsumi Yamaoka @ 2005-10-19 20:16 ` Richard M. Stallman 0 siblings, 0 replies; 10+ messages in thread From: Richard M. Stallman @ 2005-10-19 20:16 UTC (permalink / raw) Cc: bsam, handa, ding Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded- word's. It doesn't mean to prohibit to try to decode them though, and that Gnus does it would be nice. I agree. Maybe mailers should not generate this, but if they do, it is better for Gnus to handle it right. BTW, I realized that that fix was insufficient. For instance, it will display binary garbage if the charset specified in the encoded-word is unknown. So, I will CVS commit the new code after a while. Thank you in advance. ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <E1EQ6NU-000GJF-2s@bsam.ru>]
* Re: gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r [not found] <E1EQ6NU-000GJF-2s@bsam.ru> @ 2005-10-13 18:26 ` Reiner Steib 0 siblings, 0 replies; 10+ messages in thread From: Reiner Steib @ 2005-10-13 18:26 UTC (permalink / raw) Cc: emacs-pretest-bug, Ding List On Thu, Oct 13 2005, Boris B. Samorodov wrote: [ On emacs-pretest. Cc-ing Ding ] > Symptoms: > > I do have a letter with the next Subject: > ----- > Subject: =?UTF-8?B?W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQ?= =?UTF-8?B?nyDRgtC10YHRgg==?= > ----- > > In command-line mode I can do... > > $ echo "W2lwdC5ydSAjMTYzXSDQkNCy0YLQvtCe0YLQstC10YI6INCc0KHQmjog0KHQnyDRgtC10YHRgg==" | base64 -d | iconv -f utf-8 > > ...and receive the answer: > > [ipt.ru #163] АвтоОтвет: МСК: СП тест > > But gnus (from cvs as emacs) shows the next line... > > Subject: [ipt.ru #163] АвтоОтвет: МСК: СП тест > > ...which is wrong. I don't see any difference. Maybe I'm misunderstanding what you mean. > The bug appeared to be at illegal concatenation of > =?UTF-8?<foo> =?UTF-8?<bar> parts of the Subject. Whitespace between adjacent encoded words have to be ignored according to RFC 2047: ,----[ rfc2047.txt ] | (=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=) (ab) | | White space between adjacent 'encoded-word's is not | displayed. | | (=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=) (ab) | | Even multiple SPACEs between 'encoded-word's are ignored | for the purpose of display. | | (=?ISO-8859-1?Q?a?= (ab) | =?ISO-8859-1?Q?b?=) | | Any amount of linear-space-white between 'encoded-word's, | even if it includes a CRLF followed by one or more SPACEs, | is ignored for the purposes of display. `---- Bye, Reiner. -- ,,, (o o) ---ooO-(_)-Ooo--- | PGP key available | http://rsteib.home.pages.de/ ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2005-10-19 20:16 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <E1EQHq4-0002rQ-EC@fencepost.gnu.org> [not found] ` <E1EQPSa-0006iC-00@etlken> 2005-10-14 16:51 ` gnus: incorrect conversion of Subject and From field from utf-8 to koi8-r Katsumi Yamaoka 2005-10-15 0:46 ` Kenichi Handa 2005-10-15 8:28 ` Katsumi Yamaoka 2005-10-15 8:50 ` Kenichi Handa 2005-10-15 10:06 ` Katsumi Yamaoka 2005-10-16 0:25 ` Kenichi Handa 2005-10-18 18:20 ` Boris Samorodov 2005-10-19 4:12 ` Katsumi Yamaoka 2005-10-19 20:16 ` Richard M. Stallman [not found] <E1EQ6NU-000GJF-2s@bsam.ru> 2005-10-13 18:26 ` Reiner Steib
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).