From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/32587
Path: main.gmane.org!not-for-mail
From: Kenichi Handa <handa@etl.go.jp>
Newsgroups: gmane.emacs.gnus.general
Subject: Re: imap breaks latin-1 characters
Date: Tue, 26 Sep 2000 11:39:24 +0900 (JST)
Sender: owner-ding@hpc.uh.edu
Message-ID: <200009260239.LAA17251@etlken.etl.go.jp>
References: <87vgvu81n4.fsf@gnu.org> <200009181322.e8IDMYg03611@zsh.2y.net>
	<iluvgvtn673.fsf@barbar.josefsson.org>
	<200009181517.e8IFHV503937@zsh.2y.net>
	<ilur96h5xen.fsf@barbar.josefsson.org>
	<200009181958.e8IJwRT06371@zsh.2y.net>
	<iluu2bdjvbw.fsf@barbar.josefsson.org>
	<200009182234.e8IMY0R07025@zsh.2y.net>
	<iluwvg868ap.fsf@barbar.josefsson.org>
	<200009191243.e8JChFg09912@zsh.2y.net>
	<200009212313.AAA08518@djlvig.dl.ac.uk> <200009220018.e8M0IKR24299@zsh.2y.net>
NNTP-Posting-Host: coloc-standby.netfonds.no
X-Trace: main.gmane.org 1035168849 20685 80.91.224.250 (21 Oct 2002 02:54:09 GMT)
X-Complaints-To: usenet@main.gmane.org
NNTP-Posting-Date: Mon, 21 Oct 2002 02:54:09 +0000 (UTC)
Cc: ding@gnus.org, d.love@dl.ac.uk, handa@etl.go.jp
Return-Path: <owner-ding@hpc.uh.edu>
Original-Received: from fisher.math.uh.edu (fisher.math.uh.edu [129.7.128.35])
	by mailhost.sclp.com (Postfix) with ESMTP id B33AFD051E
	for <jason@mailhost.sclp.com>; Mon, 25 Sep 2000 22:44:59 -0400 (EDT)
Original-Received: from sina.hpc.uh.edu (lists@Sina.HPC.UH.EDU [129.7.3.5])
	by fisher.math.uh.edu (8.9.1/8.9.1) with ESMTP id VAC04096;
	Mon, 25 Sep 2000 21:41:14 -0500 (CDT)
Original-Received: by sina.hpc.uh.edu (TLB v0.09a (1.20 tibbs 1996/10/09 22:03:07)); Mon, 25 Sep 2000 21:40:31 -0500 (CDT)
Original-Received: from mailhost.sclp.com (postfix@66-209.196.61.interliant.com [209.196.61.66] (may be forged))
	by sina.hpc.uh.edu (8.9.3/8.9.3) with ESMTP id VAA21146
	for <ding@hpc.uh.edu>; Mon, 25 Sep 2000 21:40:19 -0500 (CDT)
Original-Received: from etlmail2.etl.go.jp (etlmail2.etl.go.jp [192.50.105.3])
	by mailhost.sclp.com (Postfix) with ESMTP id B49A5D051E
	for <ding@gnus.org>; Mon, 25 Sep 2000 22:40:37 -0400 (EDT)
Original-Received: from etlken.etl.go.jp (etlken.etl.go.jp [192.50.73.50])
	by etlmail2.etl.go.jp (8.10.1/3.7W-2000022817) with ESMTP id e8Q2dOO23880;
	Tue, 26 Sep 2000 11:39:24 +0900 (JST)
	(envelope-from handa@etl.go.jp)
Original-Received: (from handa@localhost)
	by etlken.etl.go.jp (8.8.8+Sun/3.7W-1999101307) id LAA17251;
	Tue, 26 Sep 2000 11:39:24 +0900 (JST)
X-Authentication-Warning: etlken.etl.go.jp: handa set sender to handa@etl.go.jp using -f
Original-To: zsh@cs.rochester.edu
In-reply-to: <200009220018.e8M0IKR24299@zsh.2y.net> (message from ShengHuo ZHU
	on Thu, 21 Sep 2000 20:18:20 -0400)
Precedence: list
X-Majordomo: 1.94.jlt7
Xref: main.gmane.org gmane.emacs.gnus.general:32587
X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:32587

ShengHuo ZHU <zsh@cs.rochester.edu> writes:
> Dave Love <d.love@dl.ac.uk> writes:
>>  Can you sketch what happens in Gnus, what the problems are exactly and
>>  what features you think are needed to avoid them?  I think it's too
>>  late for new features in Mule 5.0, though.

> The problems discussed are handling unibyte string or buffer.  Unibyte
> buffer was introduced in Gnus, partially because early Emacs 20 could
> not handle 8bit data properly.  Anyway, unibyte buffers and strings
> are used in Gnus 5.8 and unlikely going to be removed before Gnus 5.9
> is released.

> I found the most of these problems are related to
> unibyte-char-to-multibyte or so.  For example,

>   (char-charset (unibyte-char-to-multibyte ?\337)) => latin-iso8859-1,

> which means 8bit unibyte characters (\240-\377) are converted to
> latin-iso8859-1 characters instead of eight-bit-graphic ones (see
> DEFAULT_NONASCII_INSERT_OFFSET in the Emacs source).  I guess this
> setting is because of the compatibility.

> Now, suppose to insert an encoded (unibyte) string (maybe from some
> unibyte buffer) into a multibyte buffer, then decode it.  The string
> is garbled after inserting the buffer.  For example, you may get
> different results from these two examples (with Mule-UCS), even in the
> current Emacs 21.0.90.

> (decode-coding-string "\346\226\207" 'utf-8)

> (with-temp-buffer
>     (insert "\346\226\207")
>     (decode-coding-region (point-min) (point-max) 'utf-8)
>     (buffer-string))

> Another pair of examples, which results a "\201".

> (decode-coding-string "\337" 'iso-8859-1)

> (with-temp-buffer
>     (insert "\337")
>     (decode-coding-region (point-min) (point-max) 'iso-8859-1)
>     (buffer-string))

> Or 

> (decode-coding-string "\244\244" 'big5)

> (with-temp-buffer
>     (insert "\244\244")
>     (decode-coding-region (point-min) (point-max) 'big5)
>     (buffer-string))

I agree that Emacs Lisp programmers face annoying problem in
such a case.  The main reason I think is that we can not mix
multibyte region and unibyte region in a single buffer.
Thus, although docode-coding-string converts unibyte string
to multibyte string and encode-coding-string converts
multibyte string to unibyte string,
decode/encode-coding-region doesn't change the multibyteness
of the region.  Programers should pay attention to
multibyteness explicitly.  In your example, we must write as
below to get the same result as decode-coding-string.

(with-temp-buffer
  (set-buffer-multibyte nil)
  (insert "\244\244")
  (decode-coding-region (point-min) (point-max) 'big5)
  (set-buffer-multibyte t)
  (buffer-string))

The above simulates what decode-coding-string does.  Another
way is to use string-as-multibyte as below:

(with-temp-buffer
  (set-buffer-multibyte t)
  (insert (string-as-multibyte "\244\244"))
  (decode-coding-region (point-min) (point-max) 'big5)
  (buffer-string))

---
Ken'ichi HANDA
handa@etl.go.jp