* Re: imap breaks latin-1 characters [not found] ` <iluvgvtn673.fsf@barbar.josefsson.org> @ 2000-09-18 15:17 ` ShengHuo ZHU 2000-09-18 19:10 ` Simon Josefsson 0 siblings, 1 reply; 18+ messages in thread From: ShengHuo ZHU @ 2000-09-18 15:17 UTC (permalink / raw) Cc: ding Simon Josefsson <simon@josefsson.org> writes: > > Simon, the following code may illustrate the bug. > > > > (let (s (coding-system-for-write 'binary)) > > (mm-with-unibyte-buffer > > (insert "0123" 223 "ABCD") > > (setq s (buffer-string))) > > (with-temp-file "test" > > ;; (mm-disable-multibyte) > > (insert s))) > > > > String `s' is unibyte. When it is inserted to multibyte "test" buffer, > > Emacs convert s into multibyte. The ugly thing is that all \240-\377 > > chars are converted into latin-iso8859-1, where those \201's come > > from. > > Ouch. > > Perhaps (insert (string-as-multibyte (concat "0123" 223 "ABCD"))) > could be used instead, so we don't have to disable multibyte in the > buffer. IMHO switching multibyte on and off in various parts of > imap/nnimap/mail-source/Gnus is causing part of the headache in the > first place. But it's probably more consistent with how things work so > there's probably no point in changing it. I thought so, but I change my mind after recently I read some MULE code of Emacs. IMHO, switching multibyte on and off causes the things complicated and is not the `right thing' to do. That results inserting unibyte strings into multibyte buffers, converting \240-\377 into latin-iso8859-1, then doubly decoding. The behavior of unibyte-char-to-multibyte may also hide some bugs. For example, some NNTP servers may have some groups with latin-iso8859-1 chars in their names. The names are simply insert the group buffer without decoding. Those latin chars show. It seems working, but not for other charset. Now I suggest removing all unibyte-buffer in Gnus (maybe in oGnus). All binary data would exist as multibyte strings, though it may not work with early Emacs 20 MULE bugs (why should we bother?). ShengHuo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-18 15:17 ` imap breaks latin-1 characters ShengHuo ZHU @ 2000-09-18 19:10 ` Simon Josefsson 2000-09-18 19:58 ` ShengHuo ZHU 0 siblings, 1 reply; 18+ messages in thread From: Simon Josefsson @ 2000-09-18 19:10 UTC (permalink / raw) Cc: ding ShengHuo ZHU <zsh@cs.rochester.edu> writes: > Simon Josefsson <simon@josefsson.org> writes: > > > Perhaps (insert (string-as-multibyte (concat "0123" 223 "ABCD"))) > > could be used instead, so we don't have to disable multibyte in the > > buffer. IMHO switching multibyte on and off in various parts of > > imap/nnimap/mail-source/Gnus is causing part of the headache in the > > first place. But it's probably more consistent with how things work so > > there's probably no point in changing it. > > I thought so, but I change my mind after recently I read some MULE > code of Emacs. IMHO, switching multibyte on and off causes the things > complicated and is not the `right thing' to do. I agree fully. > The behavior of unibyte-char-to-multibyte may also hide some bugs. > For example, some NNTP servers may have some groups with > latin-iso8859-1 chars in their names. The names are simply insert the > group buffer without decoding. Those latin chars show. It seems > working, but not for other charset. This can never work properly, so I'm not sure we should care. NNTP doesn't support character set tagging. All we can provide is intelligent defaults and a possibility of telling Gnus what charset group names are in. > Now I suggest removing all unibyte-buffer in Gnus (maybe in oGnus). oGnus please, imho. A silly question: how do you convert a binary raw multibyte string into a multibyte string of a named charset? Is there a API to do this? Do you need to go over the string character by character and (make-char CHARSET c)? I'm concerned with efficiency. Large attachments really kill Gnus performance today. Reducing number of insert's, buffer-substring's, encoding, decoding and stuff would be nice. (Of course, attachments should ideally not be fetched at all unless requested, but that's another story.) > All binary data would exist as multibyte strings, though it may not > work with early Emacs 20 MULE bugs (why should we bother?). We shouldn't bother. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-18 19:10 ` Simon Josefsson @ 2000-09-18 19:58 ` ShengHuo ZHU 2000-09-18 20:31 ` Simon Josefsson 0 siblings, 1 reply; 18+ messages in thread From: ShengHuo ZHU @ 2000-09-18 19:58 UTC (permalink / raw) Simon Josefsson <simon@josefsson.org> writes: > oGnus please, imho. I think so. I don't want to make any trouble in the beta version. > A silly question: how do you convert a binary raw multibyte string > into a multibyte string of a named charset? Is there a API to do > this? Do you need to go over the string character by character and > (make-char CHARSET c)? decode-coding-string is the answer. Is it so simple? You probably think about those "\201" bugs. Let's see examples, (with-temp-buffer (insert "1234\337ABCD") (decode-coding-region (point-min) (point-max) 'iso-8859-1) (buffer-string)) (with-temp-buffer (insert (string-as-multibyte "1234\337ABCD")) (decode-coding-region (point-min) (point-max) 'iso-8859-1) (buffer-string)) In first case, it returns a string contains "\201", because "\337" of the unibyte string "1234\337ABCD" is converted into a latin-iso8859-1 character instead of the binary one. Therefore, doubly decoded. In the second case, string-as-multibyte did the trick, but it is not a total solution, especially for those cases with \200-\237 in the string. For example, (string-as-multibyte "1234\201\337ABCD") returns a string with one latin-iso8859-1 character instead of two 8-bit characters. Probably, the question is how to convert a binary raw unibyte string into a binary raw multibyte string (instead of latin-iso8859-1). I don't know any efficient solution. Therefore I suggest to avoid all unibyte buffers and non-ascii unibyte strings. > I'm concerned with efficiency. Large attachments really kill Gnus > performance today. Reducing number of insert's, buffer-substring's, > encoding, decoding and stuff would be nice. Right, too many temporary buffers are involved in mm-get-part and mm-insert-part. > (Of course, attachments should ideally not be fetched at all unless > requested, but that's another story.) It sounds like message/external. Would adding a fetch callback function into the handle structure help? ShengHuo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-18 19:58 ` ShengHuo ZHU @ 2000-09-18 20:31 ` Simon Josefsson 2000-09-18 22:34 ` ShengHuo ZHU 2000-09-18 22:42 ` Kai Großjohann 0 siblings, 2 replies; 18+ messages in thread From: Simon Josefsson @ 2000-09-18 20:31 UTC (permalink / raw) Cc: ding ShengHuo ZHU <zsh@cs.rochester.edu> writes: > decode-coding-string is the answer. Fine, thanks. > In first case, it returns a string contains "\201", because "\337" of > the unibyte string "1234\337ABCD" is converted into a latin-iso8859-1 > character instead of the binary one. Therefore, doubly decoded. > > In the second case, string-as-multibyte did the trick, but it is not a > total solution, especially for those cases with \200-\237 in the > string. For example, (string-as-multibyte "1234\201\337ABCD") returns > a string with one latin-iso8859-1 character instead of two 8-bit > characters. Ok, I see the problems. Will this ever be fixed in Emacs 20.x, or is the answer to wait for Emacs 21 here? Replacing unibyte buffers with string-as-multibyte's is of no use if string-as-multibyte is buggy. More problems: Press C-x C-e in the article buffer to evaluate your examples. I get two \201's in the echo area from the first example and one \201 from the second. Why?! The echo area _is_ a multibyte buffer, isn't it? I've seen \201's in the echo area before (BBDB) but never elsewhere, this might be the same issue. I've no idea how to debug this. > Probably, the question is how to convert a binary raw unibyte string > into a binary raw multibyte string (instead of latin-iso8859-1). I > don't know any efficient solution. Therefore I suggest to avoid all > unibyte buffers and non-ascii unibyte strings. Yes, I agree with this. > > (Of course, attachments should ideally not be fetched at all unless > > requested, but that's another story.) > > It sounds like message/external. Would adding a fetch callback > function into the handle structure help? I think a new backend interface is needed here. The mm-* function could be modified to work with a MIME structure instead of the raw mail, and whatever body parts are displayed invoke a fetch callback that fetch that part. More oGnus stuff. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-18 20:31 ` Simon Josefsson @ 2000-09-18 22:34 ` ShengHuo ZHU 2000-09-19 9:27 ` Simon Josefsson 2000-09-21 23:10 ` Dave Love 2000-09-18 22:42 ` Kai Großjohann 1 sibling, 2 replies; 18+ messages in thread From: ShengHuo ZHU @ 2000-09-18 22:34 UTC (permalink / raw) Simon Josefsson <simon@josefsson.org> writes: > > In the second case, string-as-multibyte did the trick, but it is not a > > total solution, especially for those cases with \200-\237 in the > > string. For example, (string-as-multibyte "1234\201\337ABCD") returns > > a string with one latin-iso8859-1 character instead of two 8-bit > > characters. > > Ok, I see the problems. Will this ever be fixed in Emacs 20.x, or is > the answer to wait for Emacs 21 here? Replacing unibyte buffers with > string-as-multibyte's is of no use if string-as-multibyte is buggy. It is probably a feature instead of a bug. But the document "contains an individual 8-bit byte (i.e. not part of multibyte form)" is confusing. > More problems: Press C-x C-e in the article buffer to evaluate your > examples. I get two \201's in the echo area from the first example > and one \201 from the second. Why?! The echo area _is_ a multibyte > buffer, isn't it? > I've seen \201's in the echo area before (BBDB) but never elsewhere, > this might be the same issue. I've no idea how to debug this. Wait a second. I see `1234\201ßABCD' from the first case and `1234ßABCD' from the second in both Emacs 20.7 and 21.0.90. Did you see EXACTLY? \201 could show in a multibyte buffer. For example, (insert 129). > I think a new backend interface is needed here. The mm-* function > could be modified to work with a MIME structure instead of the raw > mail, and whatever body parts are displayed invoke a fetch callback > that fetch that part. Sounds interesting. -- ShengHuo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-18 22:34 ` ShengHuo ZHU @ 2000-09-19 9:27 ` Simon Josefsson 2000-09-19 12:43 ` ShengHuo ZHU 2000-09-21 23:10 ` Dave Love 1 sibling, 1 reply; 18+ messages in thread From: Simon Josefsson @ 2000-09-19 9:27 UTC (permalink / raw) Cc: ding ShengHuo ZHU <zsh@cs.rochester.edu> writes: > > > In the second case, string-as-multibyte did the trick, but it is not a > > > total solution, especially for those cases with \200-\237 in the > > > string. For example, (string-as-multibyte "1234\201\337ABCD") returns > > > a string with one latin-iso8859-1 character instead of two 8-bit > > > characters. > > > > Ok, I see the problems. Will this ever be fixed in Emacs 20.x, or is > > the answer to wait for Emacs 21 here? Replacing unibyte buffers with > > string-as-multibyte's is of no use if string-as-multibyte is buggy. > > It is probably a feature instead of a bug. But the document "contains > an individual 8-bit byte (i.e. not part of multibyte form)" is > confusing. But won't this cause problems for us? If we replace unibyte buffers with string-as-multibyte where necessery (which I agree with), we'll mutilate mail that contain \200-\237 character? Since both (string-as-multibyte "1234\337ABCD") and (string-as-multibyte "1234\201\337ABCD") look the same, I think we're in trouble. Or is there a (string-as-multibyte-foo "1234\201\337ABCD") that return a multibyte string that will display as 1234\201ßABCD? Isn't that what we need? > > More problems: Press C-x C-e in the article buffer to evaluate your > > examples. I get two \201's in the echo area from the first example > > and one \201 from the second. Why?! The echo area _is_ a multibyte > > buffer, isn't it? > > > I've seen \201's in the echo area before (BBDB) but never elsewhere, > > this might be the same issue. I've no idea how to debug this. > > Wait a second. I see `1234\201ßABCD' from the first case and > `1234ßABCD' from the second in both Emacs 20.7 and 21.0.90. Did you > see EXACTLY? Now I don't get that behaviour, but I'm sure I did last time. Interesting. Ok, now I get it again. All I did was to do something else for a while. I'll try to narrow it down to specific commands. I first thought it was the backlog, but it doesn't seem to be. First case: `1234\201\201ßABCD' Second case: `1234\201ßABCD'. In echo area. A freshly started emacs/gnus display one less \201 in both cases. In the *scratch* buffer I always seem to get `1234\201ßABCD' and `1234ßABCD' respectively. Emacs 20.7. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-19 9:27 ` Simon Josefsson @ 2000-09-19 12:43 ` ShengHuo ZHU 2000-09-21 23:13 ` Dave Love 0 siblings, 1 reply; 18+ messages in thread From: ShengHuo ZHU @ 2000-09-19 12:43 UTC (permalink / raw) Simon Josefsson <simon@josefsson.org> writes: [...] > Or is there a (string-as-multibyte-foo "1234\201\337ABCD") that return > a multibyte string that will display as 1234\201ßABCD? Isn't that > what we need? string-as-multibyte is something acting like type casting instead of literally converting. We need something convert it to 8bit multibyte string "1234\201\337ABCD", not "1234\201ßABCD" which contains latin-iso8859-1. [...] > First case: `1234\201\201ßABCD' Second case: `1234\201ßABCD'. In echo > area. A freshly started emacs/gnus display one less \201 in both > cases. In the *scratch* buffer I always seem to get `1234\201ßABCD' > and `1234ßABCD' respectively. There could be a bug somewhere. ShengHuo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-19 12:43 ` ShengHuo ZHU @ 2000-09-21 23:13 ` Dave Love 2000-09-22 0:18 ` ShengHuo ZHU 0 siblings, 1 reply; 18+ messages in thread From: Dave Love @ 2000-09-21 23:13 UTC (permalink / raw) >>>>> "ZSH" == ShengHuo ZHU <zsh@cs.rochester.edu> writes: ZSH> We need something convert it to 8bit multibyte string ZSH> "1234\201\337ABCD", not "1234\201ßABCD" which contains ZSH> latin-iso8859-1. I don't understand this. For raw bytes, you're talking unibyte (raw-text/binary) and I'd have thought that was what you'd want to convert _from_. I'm quite confused by this thread. I'm not sure I ever really followed what happens from the server through to display, but it did seem convoluted when I looked. I don't have time to go through it now either. We should be able to get Handa to advise, though, especially with Gnus 5.8 in the Emacs sources now. I guess he will be interested in things that stress the Mule features. Can you sketch what happens in Gnus, what the problems are exactly and what features you think are needed to avoid them? I think it's too late for new features in Mule 5.0, though. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-21 23:13 ` Dave Love @ 2000-09-22 0:18 ` ShengHuo ZHU 2000-09-22 14:56 ` Dave Love 2000-09-26 2:39 ` Kenichi Handa 0 siblings, 2 replies; 18+ messages in thread From: ShengHuo ZHU @ 2000-09-22 0:18 UTC (permalink / raw) Cc: Kenichi Handa Dave Love <d.love@dl.ac.uk> writes: > >>>>> "ZSH" == ShengHuo ZHU <zsh@cs.rochester.edu> writes: > > ZSH> We need something convert it to 8bit multibyte string > ZSH> "1234\201\337ABCD", not "1234\201ßABCD" which contains > ZSH> latin-iso8859-1. > > I don't understand this. For raw bytes, you're talking unibyte > (raw-text/binary) and I'd have thought that was what you'd want to > convert _from_. > > I'm quite confused by this thread. I'm not sure I ever really > followed what happens from the server through to display, but it did > seem convoluted when I looked. I don't have time to go through it now > either. > > We should be able to get Handa to advise, though, especially with Gnus > 5.8 in the Emacs sources now. I guess he will be interested in things > that stress the Mule features. > > Can you sketch what happens in Gnus, what the problems are exactly and > what features you think are needed to avoid them? I think it's too > late for new features in Mule 5.0, though. The problems discussed are handling unibyte string or buffer. Unibyte buffer was introduced in Gnus, partially because early Emacs 20 could not handle 8bit data properly. Anyway, unibyte buffers and strings are used in Gnus 5.8 and unlikely going to be removed before Gnus 5.9 is released. I found the most of these problems are related to unibyte-char-to-multibyte or so. For example, (char-charset (unibyte-char-to-multibyte ?\337)) => latin-iso8859-1, which means 8bit unibyte characters (\240-\377) are converted to latin-iso8859-1 characters instead of eight-bit-graphic ones (see DEFAULT_NONASCII_INSERT_OFFSET in the Emacs source). I guess this setting is because of the compatibility. Now, suppose to insert an encoded (unibyte) string (maybe from some unibyte buffer) into a multibyte buffer, then decode it. The string is garbled after inserting the buffer. For example, you may get different results from these two examples (with Mule-UCS), even in the current Emacs 21.0.90. (decode-coding-string "\346\226\207" 'utf-8) (with-temp-buffer (insert "\346\226\207") (decode-coding-region (point-min) (point-max) 'utf-8) (buffer-string)) Another pair of examples, which results a "\201". (decode-coding-string "\337" 'iso-8859-1) (with-temp-buffer (insert "\337") (decode-coding-region (point-min) (point-max) 'iso-8859-1) (buffer-string)) Or (decode-coding-string "\244\244" 'big5) (with-temp-buffer (insert "\244\244") (decode-coding-region (point-min) (point-max) 'big5) (buffer-string)) ShengHuo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-22 0:18 ` ShengHuo ZHU @ 2000-09-22 14:56 ` Dave Love 2000-09-22 16:34 ` Kai Großjohann 2000-09-22 17:44 ` ShengHuo ZHU 2000-09-26 2:39 ` Kenichi Handa 1 sibling, 2 replies; 18+ messages in thread From: Dave Love @ 2000-09-22 14:56 UTC (permalink / raw) >>>>> "ZSH" == ShengHuo ZHU <zsh@cs.rochester.edu> writes: ZSH> The problems discussed are handling unibyte string or buffer. ZSH> Unibyte buffer was introduced in Gnus, partially because early ZSH> Emacs 20 could not handle 8bit data properly. So this is just a problem with the Gnus code? ZSH> which means 8bit unibyte characters (\240-\377) are converted to ZSH> latin-iso8859-1 characters instead of eight-bit-graphic ones ZSH> (see DEFAULT_NONASCII_INSERT_OFFSET in the Emacs source). Presumably you can bind `nonascii-translation-table' if necessary. It seems reasonable to be able to use a zero `nonascii-insert-offset' now. Feel free to suggest it if that will solve a problem. ZSH> Now, suppose to insert an encoded (unibyte) string (maybe from ZSH> some unibyte buffer) into a multibyte buffer, then decode it. Decoding in a unibyte buffer seems a better idea if it needs to be done explicitly. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-22 14:56 ` Dave Love @ 2000-09-22 16:34 ` Kai Großjohann 2000-09-22 17:55 ` ShengHuo ZHU 2000-09-25 11:23 ` Dave Love 2000-09-22 17:44 ` ShengHuo ZHU 1 sibling, 2 replies; 18+ messages in thread From: Kai Großjohann @ 2000-09-22 16:34 UTC (permalink / raw) Cc: ding On Fri, 22 Sep 2000, Dave Love wrote: >>>>>> "ZSH" == ShengHuo ZHU <zsh@cs.rochester.edu> writes: > > ZSH> Now, suppose to insert an encoded (unibyte) string (maybe from > ZSH> some unibyte buffer) into a multibyte buffer, then decode it. > > Decoding in a unibyte buffer seems a better idea if it needs to be > done explicitly. What if the result of decoding is multibyte? Does that automagically convert the unibyte buffer to multibyte? IIUC, the IMAP server sends a number of arbitrary bytes. Gnus needs to (temporarily) store them without any modification. Then Gnus finds out what is the encoding of those bytes and decodes the bytes according to the encoding. (Can it happen that the IMAP server now sends a byte sequence which is the result of the foo encoding, and in a few minutes it will send a byte sequence which is the result of a different encoding? I suppose that this can happen, but I'm not sure. For example, the IMAP folder names could be iso-8859-1 encoded but the message content is ShiftJIS or whatever.) Also IIUC, a multibyte buffer cannot hold any arbitrary byte sequence without modification, so the buffer that receives the bytes from the IMAP server is a unibyte buffer. So, how do we get from the byte sequence to the decoded character sequence? Always eager to learn more about Mule, kai -- I like BOTH kinds of music. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-22 16:34 ` Kai Großjohann @ 2000-09-22 17:55 ` ShengHuo ZHU 2000-09-25 11:23 ` Dave Love 1 sibling, 0 replies; 18+ messages in thread From: ShengHuo ZHU @ 2000-09-22 17:55 UTC (permalink / raw) Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes: > Also IIUC, a multibyte buffer cannot hold any arbitrary byte > sequence without modification, so the buffer that receives the bytes > from the IMAP server is a unibyte buffer. So, how do we get from > the byte sequence to the decoded character sequence? A multibyte buffer in Emacs 21.0.90 (Mule 5.0) can hold. But Emacs 20.7 or earlier can't. Try (insert ?\201 ?\337). Therefore, we can not expect Gnus doing this without using unibyte buffers before Emacs 21 is prevalent. ShengHuo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-22 16:34 ` Kai Großjohann 2000-09-22 17:55 ` ShengHuo ZHU @ 2000-09-25 11:23 ` Dave Love 1 sibling, 0 replies; 18+ messages in thread From: Dave Love @ 2000-09-25 11:23 UTC (permalink / raw) >>>>> "KG" == Kai Großjohann <Kai.Grossjohann@CS.Uni-Dortmund.DE> writes: KG> What if the result of decoding is multibyte? ?? (let (default-enable-multibyte-characters) (multibyte-string-p (decode-coding-string "a\255" 'latin-1))) => t KG> Also IIUC, a multibyte buffer cannot hold any arbitrary byte KG> sequence without modification, That's clearly not true. The issue is how you interpret the bytes as characters. KG> so the buffer that receives the bytes from the IMAP server is a KG> unibyte buffer. So, how do we get from the byte sequence to the KG> decoded character sequence? With `decode-coding-region'. If you want to decode it with a coding system which encodes all octets, that should work OK even in Emacs 20 (e.g. codepage.el). KG> Always eager to learn more about Mule, If you find the the doc (plus examples in the Emacs source of what you want to do) deficient, _please_ report it. Mule wizards presumably hang out on the English Mule list, but I'm not one of them and I can't spend time on things which aren't going to improve the release, sorry. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-22 14:56 ` Dave Love 2000-09-22 16:34 ` Kai Großjohann @ 2000-09-22 17:44 ` ShengHuo ZHU 2000-09-25 11:25 ` Dave Love 1 sibling, 1 reply; 18+ messages in thread From: ShengHuo ZHU @ 2000-09-22 17:44 UTC (permalink / raw) Dave Love <d.love@dl.ac.uk> writes: > ZSH> which means 8bit unibyte characters (\240-\377) are converted to > ZSH> latin-iso8859-1 characters instead of eight-bit-graphic ones > ZSH> (see DEFAULT_NONASCII_INSERT_OFFSET in the Emacs source). > > Presumably you can bind `nonascii-translation-table' if necessary. It > seems reasonable to be able to use a zero `nonascii-insert-offset' > now. Feel free to suggest it if that will solve a problem. Without modifying unibyte_char_to_multibyte, binding `nonascii-translation-table' would not work for this. I think that enabling a zero `nonascii-insert-offset' is a reasonable solution. Anyway, to get this done, unibyte_char_to_multibyte have to be changed. > ZSH> Now, suppose to insert an encoded (unibyte) string (maybe from > ZSH> some unibyte buffer) into a multibyte buffer, then decode it. > > Decoding in a unibyte buffer seems a better idea if it needs to be > done explicitly. Multibyte strings (the result) could not exist in a unibyte buffer, unless to force the buffer to be multibyte one after decoding. For example: (with-temp-buffer (set-buffer-multibyte nil) (insert "\337") (decode-coding-region (point-min) (point-max) 'iso-8859-1) (set-buffer-multibyte t) (buffer-string)) Is this your idea? ShengHuo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-22 17:44 ` ShengHuo ZHU @ 2000-09-25 11:25 ` Dave Love 0 siblings, 0 replies; 18+ messages in thread From: Dave Love @ 2000-09-25 11:25 UTC (permalink / raw) >>>>> "ZSH" == ShengHuo ZHU <zsh@cs.rochester.edu> writes: ZSH> Without modifying unibyte_char_to_multibyte, binding ZSH> `nonascii-translation-table' would not work for this. In what way does it not work? [...] ZSH> Is this your idea? Yes, you should normally frob the buffer's multibyteness for coding conversion. I think there are examples in the Emacs source, e.g. tar-mode IIRC; probably also Rmail. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-22 0:18 ` ShengHuo ZHU 2000-09-22 14:56 ` Dave Love @ 2000-09-26 2:39 ` Kenichi Handa 1 sibling, 0 replies; 18+ messages in thread From: Kenichi Handa @ 2000-09-26 2:39 UTC (permalink / raw) Cc: ding, d.love, handa ShengHuo ZHU <zsh@cs.rochester.edu> writes: > Dave Love <d.love@dl.ac.uk> writes: >> Can you sketch what happens in Gnus, what the problems are exactly and >> what features you think are needed to avoid them? I think it's too >> late for new features in Mule 5.0, though. > The problems discussed are handling unibyte string or buffer. Unibyte > buffer was introduced in Gnus, partially because early Emacs 20 could > not handle 8bit data properly. Anyway, unibyte buffers and strings > are used in Gnus 5.8 and unlikely going to be removed before Gnus 5.9 > is released. > I found the most of these problems are related to > unibyte-char-to-multibyte or so. For example, > (char-charset (unibyte-char-to-multibyte ?\337)) => latin-iso8859-1, > which means 8bit unibyte characters (\240-\377) are converted to > latin-iso8859-1 characters instead of eight-bit-graphic ones (see > DEFAULT_NONASCII_INSERT_OFFSET in the Emacs source). I guess this > setting is because of the compatibility. > Now, suppose to insert an encoded (unibyte) string (maybe from some > unibyte buffer) into a multibyte buffer, then decode it. The string > is garbled after inserting the buffer. For example, you may get > different results from these two examples (with Mule-UCS), even in the > current Emacs 21.0.90. > (decode-coding-string "\346\226\207" 'utf-8) > (with-temp-buffer > (insert "\346\226\207") > (decode-coding-region (point-min) (point-max) 'utf-8) > (buffer-string)) > Another pair of examples, which results a "\201". > (decode-coding-string "\337" 'iso-8859-1) > (with-temp-buffer > (insert "\337") > (decode-coding-region (point-min) (point-max) 'iso-8859-1) > (buffer-string)) > Or > (decode-coding-string "\244\244" 'big5) > (with-temp-buffer > (insert "\244\244") > (decode-coding-region (point-min) (point-max) 'big5) > (buffer-string)) I agree that Emacs Lisp programmers face annoying problem in such a case. The main reason I think is that we can not mix multibyte region and unibyte region in a single buffer. Thus, although docode-coding-string converts unibyte string to multibyte string and encode-coding-string converts multibyte string to unibyte string, decode/encode-coding-region doesn't change the multibyteness of the region. Programers should pay attention to multibyteness explicitly. In your example, we must write as below to get the same result as decode-coding-string. (with-temp-buffer (set-buffer-multibyte nil) (insert "\244\244") (decode-coding-region (point-min) (point-max) 'big5) (set-buffer-multibyte t) (buffer-string)) The above simulates what decode-coding-string does. Another way is to use string-as-multibyte as below: (with-temp-buffer (set-buffer-multibyte t) (insert (string-as-multibyte "\244\244")) (decode-coding-region (point-min) (point-max) 'big5) (buffer-string)) --- Ken'ichi HANDA handa@etl.go.jp ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-18 22:34 ` ShengHuo ZHU 2000-09-19 9:27 ` Simon Josefsson @ 2000-09-21 23:10 ` Dave Love 1 sibling, 0 replies; 18+ messages in thread From: Dave Love @ 2000-09-21 23:10 UTC (permalink / raw) >>>>> "ZSH" == ShengHuo ZHU <zsh@cs.rochester.edu> writes: ZSH> It is probably a feature instead of a bug. The behaviour of string-as-multibyte is certainly not a bug, just what it does. ZSH> But the document "contains an individual 8-bit byte (i.e. not ZSH> part of multibyte form)" is confusing. Do you mean part of the manual is confusing? If so, please make a bug report, though it sounds like text that won't be in the current manual. It's clear the Mule documentation could be better, but it really needs feedback. >> I've seen \201's in the echo area before (BBDB) but never elsewhere, >> this might be the same issue. I've no idea how to debug this. BBDB doesn't consider multilingual text. In file-coding-system-alist I have: ("\\.bbdb\\'" . emacs-mule) That definitely helps with multilingual entries, but may well not be sufficient. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: imap breaks latin-1 characters 2000-09-18 20:31 ` Simon Josefsson 2000-09-18 22:34 ` ShengHuo ZHU @ 2000-09-18 22:42 ` Kai Großjohann 1 sibling, 0 replies; 18+ messages in thread From: Kai Großjohann @ 2000-09-18 22:42 UTC (permalink / raw) Hm. I didn't fully grok the discussion, but wasn't the starting point to parse some output from the IMAP server? How can we tell what encoding is used by the IMAP server? Is the encoding used by the IMAP server always the same, or is it possible that it prints group names in Latin-2 (say) but message contents in Latin-1 (say)? If we can be sure that the output from the IMAP server is always in the same coding system, then the most natural approach would be to provide a variable for this, defaulting to process-coding-system, perhaps, and to use this coding system to decode output from the IMAP server, no? But probably, I'm missing the whole point, since I didn't really understand what you were talking about... kai -- I like BOTH kinds of music. ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2000-09-26 2:39 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <87vgvu81n4.fsf@gnu.org> [not found] ` <200009181322.e8IDMYg03611@zsh.2y.net> [not found] ` <iluvgvtn673.fsf@barbar.josefsson.org> 2000-09-18 15:17 ` imap breaks latin-1 characters ShengHuo ZHU 2000-09-18 19:10 ` Simon Josefsson 2000-09-18 19:58 ` ShengHuo ZHU 2000-09-18 20:31 ` Simon Josefsson 2000-09-18 22:34 ` ShengHuo ZHU 2000-09-19 9:27 ` Simon Josefsson 2000-09-19 12:43 ` ShengHuo ZHU 2000-09-21 23:13 ` Dave Love 2000-09-22 0:18 ` ShengHuo ZHU 2000-09-22 14:56 ` Dave Love 2000-09-22 16:34 ` Kai Großjohann 2000-09-22 17:55 ` ShengHuo ZHU 2000-09-25 11:23 ` Dave Love 2000-09-22 17:44 ` ShengHuo ZHU 2000-09-25 11:25 ` Dave Love 2000-09-26 2:39 ` Kenichi Handa 2000-09-21 23:10 ` Dave Love 2000-09-18 22:42 ` Kai Großjohann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).