From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/32587 Path: main.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.gnus.general Subject: Re: imap breaks latin-1 characters Date: Tue, 26 Sep 2000 11:39:24 +0900 (JST) Sender: owner-ding@hpc.uh.edu Message-ID: <200009260239.LAA17251@etlken.etl.go.jp> References: <87vgvu81n4.fsf@gnu.org> <200009181322.e8IDMYg03611@zsh.2y.net> <200009181517.e8IFHV503937@zsh.2y.net> <200009181958.e8IJwRT06371@zsh.2y.net> <200009182234.e8IMY0R07025@zsh.2y.net> <200009191243.e8JChFg09912@zsh.2y.net> <200009212313.AAA08518@djlvig.dl.ac.uk> <200009220018.e8M0IKR24299@zsh.2y.net> NNTP-Posting-Host: coloc-standby.netfonds.no X-Trace: main.gmane.org 1035168849 20685 80.91.224.250 (21 Oct 2002 02:54:09 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Mon, 21 Oct 2002 02:54:09 +0000 (UTC) Cc: ding@gnus.org, d.love@dl.ac.uk, handa@etl.go.jp Return-Path: Original-Received: from fisher.math.uh.edu (fisher.math.uh.edu [129.7.128.35]) by mailhost.sclp.com (Postfix) with ESMTP id B33AFD051E for ; Mon, 25 Sep 2000 22:44:59 -0400 (EDT) Original-Received: from sina.hpc.uh.edu (lists@Sina.HPC.UH.EDU [129.7.3.5]) by fisher.math.uh.edu (8.9.1/8.9.1) with ESMTP id VAC04096; Mon, 25 Sep 2000 21:41:14 -0500 (CDT) Original-Received: by sina.hpc.uh.edu (TLB v0.09a (1.20 tibbs 1996/10/09 22:03:07)); Mon, 25 Sep 2000 21:40:31 -0500 (CDT) Original-Received: from mailhost.sclp.com (postfix@66-209.196.61.interliant.com [209.196.61.66] (may be forged)) by sina.hpc.uh.edu (8.9.3/8.9.3) with ESMTP id VAA21146 for ; Mon, 25 Sep 2000 21:40:19 -0500 (CDT) Original-Received: from etlmail2.etl.go.jp (etlmail2.etl.go.jp [192.50.105.3]) by mailhost.sclp.com (Postfix) with ESMTP id B49A5D051E for ; Mon, 25 Sep 2000 22:40:37 -0400 (EDT) Original-Received: from etlken.etl.go.jp (etlken.etl.go.jp [192.50.73.50]) by etlmail2.etl.go.jp (8.10.1/3.7W-2000022817) with ESMTP id e8Q2dOO23880; Tue, 26 Sep 2000 11:39:24 +0900 (JST) (envelope-from handa@etl.go.jp) Original-Received: (from handa@localhost) by etlken.etl.go.jp (8.8.8+Sun/3.7W-1999101307) id LAA17251; Tue, 26 Sep 2000 11:39:24 +0900 (JST) X-Authentication-Warning: etlken.etl.go.jp: handa set sender to handa@etl.go.jp using -f Original-To: zsh@cs.rochester.edu In-reply-to: <200009220018.e8M0IKR24299@zsh.2y.net> (message from ShengHuo ZHU on Thu, 21 Sep 2000 20:18:20 -0400) Precedence: list X-Majordomo: 1.94.jlt7 Xref: main.gmane.org gmane.emacs.gnus.general:32587 X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:32587 ShengHuo ZHU writes: > Dave Love writes: >> Can you sketch what happens in Gnus, what the problems are exactly and >> what features you think are needed to avoid them? I think it's too >> late for new features in Mule 5.0, though. > The problems discussed are handling unibyte string or buffer. Unibyte > buffer was introduced in Gnus, partially because early Emacs 20 could > not handle 8bit data properly. Anyway, unibyte buffers and strings > are used in Gnus 5.8 and unlikely going to be removed before Gnus 5.9 > is released. > I found the most of these problems are related to > unibyte-char-to-multibyte or so. For example, > (char-charset (unibyte-char-to-multibyte ?\337)) => latin-iso8859-1, > which means 8bit unibyte characters (\240-\377) are converted to > latin-iso8859-1 characters instead of eight-bit-graphic ones (see > DEFAULT_NONASCII_INSERT_OFFSET in the Emacs source). I guess this > setting is because of the compatibility. > Now, suppose to insert an encoded (unibyte) string (maybe from some > unibyte buffer) into a multibyte buffer, then decode it. The string > is garbled after inserting the buffer. For example, you may get > different results from these two examples (with Mule-UCS), even in the > current Emacs 21.0.90. > (decode-coding-string "\346\226\207" 'utf-8) > (with-temp-buffer > (insert "\346\226\207") > (decode-coding-region (point-min) (point-max) 'utf-8) > (buffer-string)) > Another pair of examples, which results a "\201". > (decode-coding-string "\337" 'iso-8859-1) > (with-temp-buffer > (insert "\337") > (decode-coding-region (point-min) (point-max) 'iso-8859-1) > (buffer-string)) > Or > (decode-coding-string "\244\244" 'big5) > (with-temp-buffer > (insert "\244\244") > (decode-coding-region (point-min) (point-max) 'big5) > (buffer-string)) I agree that Emacs Lisp programmers face annoying problem in such a case. The main reason I think is that we can not mix multibyte region and unibyte region in a single buffer. Thus, although docode-coding-string converts unibyte string to multibyte string and encode-coding-string converts multibyte string to unibyte string, decode/encode-coding-region doesn't change the multibyteness of the region. Programers should pay attention to multibyteness explicitly. In your example, we must write as below to get the same result as decode-coding-string. (with-temp-buffer (set-buffer-multibyte nil) (insert "\244\244") (decode-coding-region (point-min) (point-max) 'big5) (set-buffer-multibyte t) (buffer-string)) The above simulates what decode-coding-string does. Another way is to use string-as-multibyte as below: (with-temp-buffer (set-buffer-multibyte t) (insert (string-as-multibyte "\244\244")) (decode-coding-region (point-min) (point-max) 'big5) (buffer-string)) --- Ken'ichi HANDA handa@etl.go.jp