Gnus development mailing list
 help / color / mirror / Atom feed
* Re: `.newsrc.eld' saves chinese group name in wrong coding
       [not found] <ufydnay6j.fsf@gmail.com>
@ 2006-10-19  2:54 ` Chong Yidong
  2006-10-19  3:56   ` Katsumi Yamaoka
  0 siblings, 1 reply; 45+ messages in thread
From: Chong Yidong @ 2006-10-19  2:54 UTC (permalink / raw)
  Cc: Zhang Wei, ding, Kenichi Handa

Zhang Wei <id.brep@gmail.com> writes:

> `.newsrc.eld' can't save chinese group name in proper coding. When gnus
> is restarted, all of the articles in groups with chinese name are marked
> unread. But enter that group, you will find all of the articles are old
> articles (marked by an `O'). The file in the attachment is the wrong
> formatted `.newsrc.eld', hope that will be helpful.

The problem seems to be that when a Chinese group name is given, e.g.
"好", `gnus-group-insert-group-line' ends up calling

  (decode-coding-string "好" 'utf-8)

which gives gibberish.  Could either the coding systems experts
(i.e. Handa) or Gnus experts tell us why this is the wrong thing to
do?

I think the way to reproduce this is as follows:

1. save an empty file with a Chinese filename:
   C-x C-f 好 RET RET C-x C-s

   (I simply copied the character into the minibuffer from the HELLO
   file.)

2. go to the Gnus group buffer:
   M-x gnus RET

3. Open that file as a Gnus group:
   G f

=> Gnus group line is shown in Gibberish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-19  2:54 ` `.newsrc.eld' saves chinese group name in wrong coding Chong Yidong
@ 2006-10-19  3:56   ` Katsumi Yamaoka
  2006-10-19  4:11     ` Katsumi Yamaoka
  0 siblings, 1 reply; 45+ messages in thread
From: Katsumi Yamaoka @ 2006-10-19  3:56 UTC (permalink / raw)
  Cc: emacs-pretest-bug, Zhang Wei, ding, Kenichi Handa

>>>>> In <87mz7tm2wn.fsf@furball.mit.edu>
>>>>>	Chong Yidong <cyd@stupidchicken.com> wrote:

> Zhang Wei <id.brep@gmail.com> writes:

>> `.newsrc.eld' can't save chinese group name in proper coding. When gnus
>> is restarted, all of the articles in groups with chinese name are marked
>> unread. But enter that group, you will find all of the articles are old
>> articles (marked by an `O'). The file in the attachment is the wrong
>> formatted `.newsrc.eld', hope that will be helpful.

> The problem seems to be that when a Chinese group name is given, e.g.
> "好", `gnus-group-insert-group-line' ends up calling

>   (decode-coding-string "好" 'utf-8)

> which gives gibberish.  Could either the coding systems experts
> (i.e. Handa) or Gnus experts tell us why this is the wrong thing to
> do?

Gnus uses utf-8 encoded non-ASCII group names internally, those
encoded names are saved in the .newsrc.eld file, and they are
decoded by utf-8 when displaying.  I had no problem when I once
tried nnrss groups with Japanese names.  So, I cannot imagine
what is happening with Zhang Wei, sorry.

> I think the way to reproduce this is as follows:

> 1. save an empty file with a Chinese filename:
>    C-x C-f 好 RET RET C-x C-s

>    (I simply copied the character into the minibuffer from the HELLO
>    file.)

> 2. go to the Gnus group buffer:
>    M-x gnus RET

> 3. Open that file as a Gnus group:
>    G f

> => Gnus group line is shown in Gibberish

It is caused because of the default value of
`gnus-group-name-charset-group-alist'.  It can be fixed with the
following:

(push '("\\`nndoc\\(\\+?[^:]+\\)?:")
      gnus-group-name-charset-group-alist)

However, I'm not quite sure making it the new default is
generally good.

Regards,

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-19  3:56   ` Katsumi Yamaoka
@ 2006-10-19  4:11     ` Katsumi Yamaoka
  2006-10-19  8:33       ` Reiner Steib
  0 siblings, 1 reply; 45+ messages in thread
From: Katsumi Yamaoka @ 2006-10-19  4:11 UTC (permalink / raw)
  Cc: emacs-pretest-bug, Zhang Wei, ding, Kenichi Handa

>>>>> In <b4m4pu1m00i.fsf@jpl.org> Katsumi Yamaoka wrote:

> It can be fixed with the following:

> (push '("\\`nndoc\\(\\+?[^:]+\\)?:")
>       gnus-group-name-charset-group-alist)

I mistyped it.  Here's what I wanted to write.

(push '("\\`nndoc\\(?:\\+[^:]+\\)?:")
      gnus-group-name-charset-group-alist)

In addition, just now I noticed it is insufficient to solve the
problem.  Maybe we need to do the fix here and there in Gnus to
enable it to work with non-ASCII nndoc group names.

Regards,

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-19  4:11     ` Katsumi Yamaoka
@ 2006-10-19  8:33       ` Reiner Steib
  2006-10-19  9:03         ` Katsumi Yamaoka
  0 siblings, 1 reply; 45+ messages in thread
From: Reiner Steib @ 2006-10-19  8:33 UTC (permalink / raw)
  Cc: Zhang Wei

On Thu, Oct 19 2006, Katsumi Yamaoka wrote:

> Gnus uses utf-8 encoded non-ASCII group names internally, those
> encoded names are saved in the .newsrc.eld file, and they are
> decoded by utf-8 when displaying.  I had no problem when I once
> tried nnrss groups with Japanese names.  So, I cannot imagine
> what is happening with Zhang Wei, sorry.
[...]
> (push '("\\`nndoc\\(?:\\+[^:]+\\)?:")
>       gnus-group-name-charset-group-alist)
>
> In addition, just now I noticed it is insufficient to solve the
> problem.  Maybe we need to do the fix here and there in Gnus to
> enable it to work with non-ASCII nndoc group names.

The default value of `gnus-group-name-charset-group-alist' is ((".*"
. utf-8)), so it should cover all groups, IIUC.  Or am I
misunderstanding the issue?

Why is setting it to nil for nndoc necessary?  Is nndoc handled
differently than other backends?

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-19  8:33       ` Reiner Steib
@ 2006-10-19  9:03         ` Katsumi Yamaoka
  2006-10-20  3:39           ` Chong Yidong
  2006-10-20  6:04           ` Eli Zaretskii
  0 siblings, 2 replies; 45+ messages in thread
From: Katsumi Yamaoka @ 2006-10-19  9:03 UTC (permalink / raw)
  Cc: ding, Zhang Wei

>>>>> In <v9slhkn1qs.fsf@marauder.physik.uni-ulm.de>
>>>>>	Reiner Steib wrote:

> On Thu, Oct 19 2006, Katsumi Yamaoka wrote:

>> Gnus uses utf-8 encoded non-ASCII group names internally, those
>> encoded names are saved in the .newsrc.eld file, and they are
>> decoded by utf-8 when displaying.  I had no problem when I once
>> tried nnrss groups with Japanese names.  So, I cannot imagine
>> what is happening with Zhang Wei, sorry.
> [...]
>> (push '("\\`nndoc\\(?:\\+[^:]+\\)?:")
>>       gnus-group-name-charset-group-alist)
>>
>> In addition, just now I noticed it is insufficient to solve the
>> problem.  Maybe we need to do the fix here and there in Gnus to
>> enable it to work with non-ASCII nndoc group names.

> The default value of `gnus-group-name-charset-group-alist' is ((".*"
> . utf-8)), so it should cover all groups, IIUC.  Or am I
> misunderstanding the issue?

> Why is setting it to nil for nndoc necessary?  Is nndoc handled
> differently than other backends?

I figured out a moment ago that that was wrong approach.  All
group names should be utf-8 encoded for the internal use in
Gnus, so the value ((".*" . utf-8)) is necessary and sufficient.
IIUC, the difference between nnrss and nndoc is that the former
encodes a non-ASCII group name first.

Regards,



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-19  9:03         ` Katsumi Yamaoka
@ 2006-10-20  3:39           ` Chong Yidong
  2006-10-20  4:06             ` Katsumi Yamaoka
  2006-10-20  6:04           ` Eli Zaretskii
  1 sibling, 1 reply; 45+ messages in thread
From: Chong Yidong @ 2006-10-20  3:39 UTC (permalink / raw)
  Cc: emacs-pretest-bug

Katsumi Yamaoka <yamaoka@jpl.org> writes:

> I figured out a moment ago that that was wrong approach.  All
> group names should be utf-8 encoded for the internal use in
> Gnus, so the value ((".*" . utf-8)) is necessary and sufficient.
> IIUC, the difference between nnrss and nndoc is that the former
> encodes a non-ASCII group name first.

So what needs to be changed?




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-20  3:39           ` Chong Yidong
@ 2006-10-20  4:06             ` Katsumi Yamaoka
  2006-10-20  5:18               ` Katsumi Yamaoka
  0 siblings, 1 reply; 45+ messages in thread
From: Katsumi Yamaoka @ 2006-10-20  4:06 UTC (permalink / raw)
  Cc: emacs-pretest-bug

>>>>> In <87u01zd5ba.fsf@furball.mit.edu> Chong Yidong wrote:
> Katsumi Yamaoka <yamaoka@jpl.org> writes:

>> I figured out a moment ago that that was wrong approach.  All
>> group names should be utf-8 encoded for the internal use in
>> Gnus, so the value ((".*" . utf-8)) is necessary and sufficient.
>> IIUC, the difference between nnrss and nndoc is that the former
>> encodes a non-ASCII group name first.

> So what needs to be changed?

I'm now looking into it.  However, I think improving of nndoc
might not help Zhang Wei because the problem looked caused by
the nntp group.  So, I'm not urged by myself so much.

>>>>> In <ufydnay6j.fsf@gmail.com>
>>>>>	Zhang Wei <id.brep@gmail.com> wrote:

[...]

> (setq gnus-newsrc-alist '(("\301\367\320\30799.\261\276\265\330\262\342\312\324" 3 ((1 . 8)) ((seen (1 . 8))))...



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-20  4:06             ` Katsumi Yamaoka
@ 2006-10-20  5:18               ` Katsumi Yamaoka
  0 siblings, 0 replies; 45+ messages in thread
From: Katsumi Yamaoka @ 2006-10-20  5:18 UTC (permalink / raw)
  Cc: emacs-pretest-bug

>>>>> In <b4mods7hbr3.fsf@jpl.org> Katsumi Yamaoka wrote:
>>>>>> In <87u01zd5ba.fsf@furball.mit.edu> Chong Yidong wrote:

>> So what needs to be changed?

> I'm now looking into it.  However, I think improving of nndoc
> might not help Zhang Wei because the problem looked caused by
> the nntp group.  So, I'm not urged by myself so much.

>>>>>> In <ufydnay6j.fsf@gmail.com>
>>>>>>	Zhang Wei <id.brep@gmail.com> wrote:

> [...]

>> (setq gnus-newsrc-alist '(("\301\367\320\30799.\261\276\265\330\262\342\312\324" 3 ((1 . 8)) ((seen (1 . 8))))...

The following patch enables Gnus to use non-ASCII names in nndoc
groups.  I've tested it with the "~/好" file containing mbox
data.  After some other tests, I will install it to the Gnus
trunk and the v5-10 branch.

I don't think it solves Zhang Wei's problem anyway, though.  I'm
unable to test with nntp groups of non-ASCII names, but IIRC,
Gnus has been completed to run with those groups a couple of
years ago (even if there might still be trivial difficulties).

--8<---------------cut here---------------start------------->8---
*** gnus-group.el~	Mon Jul 17 21:52:02 2006
--- gnus-group.el	Fri Oct 20 05:15:50 2006
***************
*** 2680,2692 ****
  			  (t (setq err (format "%c unknown. " char))
  			     nil))))
        (setq type found)))
!   (let* ((file (expand-file-name file))
! 	 (name (gnus-generate-new-group-name
! 		(gnus-group-prefixed-name
! 		 (file-name-nondirectory file) '(nndoc "")))))
      (gnus-group-make-group
!      (gnus-group-real-name name)
!      (list 'nndoc file
  	   (list 'nndoc-address file)
  	   (list 'nndoc-article-type (or type 'guess))))))
  
--- 2680,2697 ----
  			  (t (setq err (format "%c unknown. " char))
  			     nil))))
        (setq type found)))
!   (setq file (expand-file-name file))
!   (let ((name (gnus-generate-new-group-name
! 	       (gnus-group-prefixed-name
! 		(file-name-nondirectory file) '(nndoc ""))))
! 	(encodable (mm-coding-system-p 'utf-8)))
      (gnus-group-make-group
!      (if encodable
! 	 (mm-encode-coding-string (gnus-group-real-name name) 'utf-8)
!        (gnus-group-real-name name))
!      (list 'nndoc (if encodable
! 		      (mm-encode-coding-string file 'utf-8)
! 		    file)
  	   (list 'nndoc-address file)
  	   (list 'nndoc-article-type (or type 'guess))))))
  
--8<---------------cut here---------------end--------------->8---



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-19  9:03         ` Katsumi Yamaoka
  2006-10-20  3:39           ` Chong Yidong
@ 2006-10-20  6:04           ` Eli Zaretskii
  2006-10-20  6:21             ` Katsumi Yamaoka
  1 sibling, 1 reply; 45+ messages in thread
From: Eli Zaretskii @ 2006-10-20  6:04 UTC (permalink / raw)
  Cc: emacs-pretest-bug, id.brep, ding

> Date: Thu, 19 Oct 2006 18:03:55 +0900
> From: Katsumi Yamaoka <yamaoka@jpl.org>
> Cc: Zhang Wei <id.brep@gmail.com>, ding@gnus.org
> 
> All group names should be utf-8 encoded for the internal use in Gnus

I don't know anything about Gnus, but is this sentence really right?
Gnus is part of Emacs, and Emacs normally doesn't use encoded strings
internally, it only encodes them when it writes them to a file or
sends them to a program.

Did you perhaps mean ``all group names should use characters from the
mule-unicode-* character set''?  That would make sense to me.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-20  6:04           ` Eli Zaretskii
@ 2006-10-20  6:21             ` Katsumi Yamaoka
  2006-10-20  6:38               ` Eli Zaretskii
  0 siblings, 1 reply; 45+ messages in thread
From: Katsumi Yamaoka @ 2006-10-20  6:21 UTC (permalink / raw)
  Cc: emacs-pretest-bug, id.brep, ding

>>>>> In <ubqo7r08q.fsf@gnu.org> Eli Zaretskii wrote:
>> Date: Thu, 19 Oct 2006 18:03:55 +0900
>> From: Katsumi Yamaoka <yamaoka@jpl.org>
>> Cc: Zhang Wei <id.brep@gmail.com>, ding@gnus.org
>>
>> All group names should be utf-8 encoded for the internal use in Gnus

> I don't know anything about Gnus, but is this sentence really right?
> Gnus is part of Emacs, and Emacs normally doesn't use encoded strings
> internally, it only encodes them when it writes them to a file or
> sends them to a program.

> Did you perhaps mean ``all group names should use characters from the
> mule-unicode-* character set''?  That would make sense to me.

No, Gnus uses `(encode-coding-string "name" 'utf-8)' as a
group name internally.

IIRC, nntp servers understand utf-8 encoded group names.  So,
someone might have considered making Gnus use them internally is
convenient to communicate with nntp servers.  I'm not quite sure
it is the best way even if the way was easy to enable Gnus to
use non-ASCII group names.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-20  6:21             ` Katsumi Yamaoka
@ 2006-10-20  6:38               ` Eli Zaretskii
  2006-10-20  8:59                 ` Katsumi Yamaoka
                                   ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Eli Zaretskii @ 2006-10-20  6:38 UTC (permalink / raw)
  Cc: emacs-pretest-bug, id.brep, ding

> Date: Fri, 20 Oct 2006 15:21:53 +0900
> From: Katsumi Yamaoka <yamaoka@jpl.org>
> Cc: emacs-pretest-bug@gnu.org, id.brep@gmail.com, ding@gnus.org
> 
> IIRC, nntp servers understand utf-8 encoded group names.  So,
> someone might have considered making Gnus use them internally is
> convenient to communicate with nntp servers.

I'd say this design decision will certainly cause subtle bugs, such as
the one we are discussing in this thread.  I suggest to modify the
design to not use encoded strings internally.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-20  6:38               ` Eli Zaretskii
@ 2006-10-20  8:59                 ` Katsumi Yamaoka
  2006-10-21  2:03                   ` Richard Stallman
  2006-10-20 19:19                 ` Stefan Monnier
  2006-10-21  1:01                 ` Kenichi Handa
  2 siblings, 1 reply; 45+ messages in thread
From: Katsumi Yamaoka @ 2006-10-20  8:59 UTC (permalink / raw)
  Cc: emacs-pretest-bug, id.brep, ding

>>>>> In <u3b9jqyod.fsf@gnu.org> Eli Zaretskii wrote:
>> Date: Fri, 20 Oct 2006 15:21:53 +0900
>> From: Katsumi Yamaoka <yamaoka@jpl.org>
>> Cc: emacs-pretest-bug@gnu.org, id.brep@gmail.com, ding@gnus.org
>>
>> IIRC, nntp servers understand utf-8 encoded group names.  So,
>> someone might have considered making Gnus use them internally is
>> convenient to communicate with nntp servers.

> I'd say this design decision will certainly cause subtle bugs, such as
> the one we are discussing in this thread.  I suggest to modify the
> design to not use encoded strings internally.

I hastened to change the nndoc code so as to use encoded group
names but I agree with you.  Though to implement it will take
efforts and a long time, I think it is a subject to have to be
solved in the future anyway.

BTW, I realized that I misunderstood Zhang Wei's case.  The
group name is encoded by gb2312, not utf-8, as Handa-san wrote.
It might be the default of the nntp server that Zhang Wei uses,
or the news administrator might have done something wrong.  If
it is utf-8, Gnus should work (in other words, there is currently
no way to enable Gnus to handle gb2312 encoded group names).



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-20  6:38               ` Eli Zaretskii
  2006-10-20  8:59                 ` Katsumi Yamaoka
@ 2006-10-20 19:19                 ` Stefan Monnier
  2006-10-20 20:30                   ` Eli Zaretskii
  2006-10-21  1:01                 ` Kenichi Handa
  2 siblings, 1 reply; 45+ messages in thread
From: Stefan Monnier @ 2006-10-20 19:19 UTC (permalink / raw)
  Cc: Katsumi Yamaoka, emacs-pretest-bug, id.brep, ding

>> IIRC, nntp servers understand utf-8 encoded group names.  So,
>> someone might have considered making Gnus use them internally is
>> convenient to communicate with nntp servers.

> I'd say this design decision will certainly cause subtle bugs, such as
> the one we are discussing in this thread.  I suggest to modify the
> design to not use encoded strings internally.

It could be, although it would make sense to manipulate group names in
"encoded" form, in the sense of "not decoded".  I.e. keep the group names
obtained from the news server in their raw unibyte form, and only decode for
display purposes and only encode when the name comes from another place than
the server itself.  This way, Gnus should be able to (partly) work with
arbitrary encodings rather than mandating utf-8.  This may also help with
problems linked to utf-8 normalization (or lack thereof).


        Stefan



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-20 19:19                 ` Stefan Monnier
@ 2006-10-20 20:30                   ` Eli Zaretskii
  2006-10-20 22:06                     ` Stefan Monnier
  0 siblings, 1 reply; 45+ messages in thread
From: Eli Zaretskii @ 2006-10-20 20:30 UTC (permalink / raw)
  Cc: yamaoka, emacs-pretest-bug, id.brep, ding

> Cc: Katsumi Yamaoka <yamaoka@jpl.org>,  emacs-pretest-bug@gnu.org,
> 	  id.brep@gmail.com,  ding@gnus.org
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Fri, 20 Oct 2006 15:19:43 -0400
> 
> > I'd say this design decision will certainly cause subtle bugs, such as
> > the one we are discussing in this thread.  I suggest to modify the
> > design to not use encoded strings internally.
> 
> It could be, although it would make sense to manipulate group names in
> "encoded" form, in the sense of "not decoded".

It could ``make sense'', but it's IMO a bad idea, since, as we both
know, Emacs is not well suited to handling unibyte strings.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-20 20:30                   ` Eli Zaretskii
@ 2006-10-20 22:06                     ` Stefan Monnier
  2006-10-21  9:22                       ` Eli Zaretskii
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Monnier @ 2006-10-20 22:06 UTC (permalink / raw)
  Cc: yamaoka, emacs-pretest-bug, id.brep, ding

>> > I'd say this design decision will certainly cause subtle bugs, such as
>> > the one we are discussing in this thread.  I suggest to modify the
>> > design to not use encoded strings internally.
>> 
>> It could be, although it would make sense to manipulate group names in
>> "encoded" form, in the sense of "not decoded".

> It could ``make sense'', but it's IMO a bad idea, since, as we both
> know, Emacs is not well suited to handling unibyte strings.

Huh?  Unibyte strings are perfectly well supported as far as I know.

You have to be careful to remember which strings are unibyte and which are
multibyte, so you don't decode multibyte strings or encode unibyte strings,
and especially not implicitly (by inserting a unibyte string in a multibyte
buffer or vice versa).  So if you mean that it requires discipline, then
I agree, but otherwise I don't know what you're referring to.


        Stefan



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-20  6:38               ` Eli Zaretskii
  2006-10-20  8:59                 ` Katsumi Yamaoka
  2006-10-20 19:19                 ` Stefan Monnier
@ 2006-10-21  1:01                 ` Kenichi Handa
  2 siblings, 0 replies; 45+ messages in thread
From: Kenichi Handa @ 2006-10-21  1:01 UTC (permalink / raw)
  Cc: emacs-pretest-bug, yamaoka, id.brep, ding

In article <u3b9jqyod.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > Date: Fri, 20 Oct 2006 15:21:53 +0900
> > From: Katsumi Yamaoka <yamaoka@jpl.org>
> > Cc: emacs-pretest-bug@gnu.org, id.brep@gmail.com, ding@gnus.org
> > 
> > IIRC, nntp servers understand utf-8 encoded group names.  So,
> > someone might have considered making Gnus use them internally is
> > convenient to communicate with nntp servers.

> I'd say this design decision will certainly cause subtle bugs, such as
> the one we are discussing in this thread.  I suggest to modify the
> design to not use encoded strings internally.

I agree.  Keeping around encoded strings quite easily leads
to bugs.  String/buffer should be encoded only just before
writing out.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-20  8:59                 ` Katsumi Yamaoka
@ 2006-10-21  2:03                   ` Richard Stallman
  2006-10-22 23:28                     ` Katsumi Yamaoka
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Stallman @ 2006-10-21  2:03 UTC (permalink / raw)
  Cc: emacs-pretest-bug, id.brep, ding

    > I'd say this design decision will certainly cause subtle bugs, such as
    > the one we are discussing in this thread.  I suggest to modify the
    > design to not use encoded strings internally.

    I hastened to change the nndoc code so as to use encoded group
    names but I agree with you.  Though to implement it will take
    efforts and a long time, I think it is a subject to have to be
    solved in the future anyway.

I don't entirely understand that statement.
Are you about to fix this now, or do you think it should be
delayed?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-20 22:06                     ` Stefan Monnier
@ 2006-10-21  9:22                       ` Eli Zaretskii
  2006-10-23  3:55                         ` Stefan Monnier
  0 siblings, 1 reply; 45+ messages in thread
From: Eli Zaretskii @ 2006-10-21  9:22 UTC (permalink / raw)
  Cc: emacs-pretest-bug, yamaoka, id.brep, ding

> Cc: yamaoka@jpl.org,  emacs-pretest-bug@gnu.org,  id.brep@gmail.com,
> 	  ding@gnus.org
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Fri, 20 Oct 2006 18:06:09 -0400
> 
> >> It could be, although it would make sense to manipulate group names in
> >> "encoded" form, in the sense of "not decoded".
> 
> > It could ``make sense'', but it's IMO a bad idea, since, as we both
> > know, Emacs is not well suited to handling unibyte strings.
> 
> Huh?  Unibyte strings are perfectly well supported as far as I know.
> 
> You have to be careful to remember which strings are unibyte and which are
> multibyte, so you don't decode multibyte strings or encode unibyte strings,
> and especially not implicitly (by inserting a unibyte string in a multibyte
> buffer or vice versa).  So if you mean that it requires discipline, then
> I agree, but otherwise I don't know what you're referring to.

To me, the second paragraph is precisely the meaning of ``not well
suited'' and ``not perfectly supported''.  What kind of ``well
supported'' is that if I as a programmer need to carry with each
string additional information, and make sure I know _exactly_ what
primitives are invoked by every function I call, to take care that I
don't inadvertently call something that deep inside assumes I passed a
multibyte string?

That way lies madness.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-21  2:03                   ` Richard Stallman
@ 2006-10-22 23:28                     ` Katsumi Yamaoka
  2006-10-23 11:45                       ` Richard Stallman
  0 siblings, 1 reply; 45+ messages in thread
From: Katsumi Yamaoka @ 2006-10-22 23:28 UTC (permalink / raw)
  Cc: emacs-pretest-bug, id.brep, ding

>>>>> In <E1Gb6Cw-0006wp-6I@fencepost.gnu.org> Richard Stallman wrote:

>> I'd say this design decision will certainly cause subtle bugs, such as
>> the one we are discussing in this thread.  I suggest to modify the
>> design to not use encoded strings internally.

>     I hastened to change the nndoc code so as to use encoded group
>     names but I agree with you.  Though to implement it will take
>     efforts and a long time, I think it is a subject to have to be
>     solved in the future anyway.

> I don't entirely understand that statement.
> Are you about to fix this now, or do you think it should be
> delayed?

I've already fixed the nndoc code in both the Gnus CVS trunk and
the v5-10 branch (it will be merged into the Emacs CVS soon).
Although I haven't yet changed the handling of non-ASCII group
names (that is, Gnus still represents them in the utf-8 encoded
style internally), it won't trouble users.

I agree with making Gnus encode non-ASCII group names only when
communicating with nntp servers, and I (or someone?) will try it
in the future.  I think it should be done in the Gnus trunk
first, and it will take time for coding, testing, and possibly
bug fixing.  So, importing it into Emacs will probably be
inevitably delayed.  At the present time, I don't know whether
it is days, weeks or years.

Regards,

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-21  9:22                       ` Eli Zaretskii
@ 2006-10-23  3:55                         ` Stefan Monnier
  2006-10-23  4:16                           ` Eli Zaretskii
  2006-10-23 11:45                           ` Richard Stallman
  0 siblings, 2 replies; 45+ messages in thread
From: Stefan Monnier @ 2006-10-23  3:55 UTC (permalink / raw)
  Cc: emacs-pretest-bug, yamaoka, id.brep, ding

>> >> It could be, although it would make sense to manipulate group names in
>> >> "encoded" form, in the sense of "not decoded".
>> 
>> > It could ``make sense'', but it's IMO a bad idea, since, as we both
>> > know, Emacs is not well suited to handling unibyte strings.
>> 
>> Huh?  Unibyte strings are perfectly well supported as far as I know.
>> 
>> You have to be careful to remember which strings are unibyte and which are
>> multibyte, so you don't decode multibyte strings or encode unibyte strings,
>> and especially not implicitly (by inserting a unibyte string in a multibyte
>> buffer or vice versa).  So if you mean that it requires discipline, then
>> I agree, but otherwise I don't know what you're referring to.

> To me, the second paragraph is precisely the meaning of ``not well
> suited'' and ``not perfectly supported''.  What kind of ``well
> supported'' is that if I as a programmer need to carry with each
> string additional information, and make sure I know _exactly_ what
> primitives are invoked by every function I call, to take care that I
> don't inadvertently call something that deep inside assumes I passed a
> multibyte string?

> That way lies madness.

Agreed, but note that this problem is as much on the unibyte side as it is
on the multibyte side, so that seems to imply that you also thing that Emacs
is not well suited to handling multibyte strings.

This said, I agree that Emacs should help more.  E.g. by signalling an error
when trying to insert multibyte text into a unibyte buffer.


        Stefan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-23  3:55                         ` Stefan Monnier
@ 2006-10-23  4:16                           ` Eli Zaretskii
  2006-10-23 19:11                             ` Stefan Monnier
  2006-10-23 11:45                           ` Richard Stallman
  1 sibling, 1 reply; 45+ messages in thread
From: Eli Zaretskii @ 2006-10-23  4:16 UTC (permalink / raw)
  Cc: emacs-pretest-bug, yamaoka, id.brep, ding

> Cc: yamaoka@jpl.org,  emacs-pretest-bug@gnu.org,  id.brep@gmail.com,
> 	  ding@gnus.org
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Sun, 22 Oct 2006 23:55:29 -0400
> 
> Agreed, but note that this problem is as much on the unibyte side as it is
> on the multibyte side

Not if I never let unibyte strings into my buffers and strings (modulo
bugs, of course).

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-22 23:28                     ` Katsumi Yamaoka
@ 2006-10-23 11:45                       ` Richard Stallman
  0 siblings, 0 replies; 45+ messages in thread
From: Richard Stallman @ 2006-10-23 11:45 UTC (permalink / raw)
  Cc: emacs-pretest-bug, id.brep, ding

    I agree with making Gnus encode non-ASCII group names only when
    communicating with nntp servers, and I (or someone?) will try it
    in the future.  I think it should be done in the Gnus trunk
    first, and it will take time for coding, testing, and possibly
    bug fixing.

If the existing code works for the users, I'd prefer that we not
install a further redesign before the Emacs 22 release.

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-23  3:55                         ` Stefan Monnier
  2006-10-23  4:16                           ` Eli Zaretskii
@ 2006-10-23 11:45                           ` Richard Stallman
  2006-10-23 19:16                             ` Stefan Monnier
  1 sibling, 1 reply; 45+ messages in thread
From: Richard Stallman @ 2006-10-23 11:45 UTC (permalink / raw)
  Cc: id.brep, emacs-pretest-bug, yamaoka, ding

    This said, I agree that Emacs should help more.  E.g. by signalling an error
    when trying to insert multibyte text into a unibyte buffer.

This operation converts the string to unibyte.  It works correctly,
provided the characters in that string can be expressed in the unibyte
buffer.

If people generally agree it would be better to signal an error,
we could do that.  However, that would cause trouble trying to use
M-y to move past multibyte entries in the kill ring to reach the
unibyte entry you really want.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-23  4:16                           ` Eli Zaretskii
@ 2006-10-23 19:11                             ` Stefan Monnier
  2006-10-23 20:06                               ` Eli Zaretskii
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Monnier @ 2006-10-23 19:11 UTC (permalink / raw)
  Cc: emacs-pretest-bug, yamaoka, id.brep, ding

>> Agreed, but note that this problem is as much on the unibyte side as it is
>> on the multibyte side

> Not if I never let unibyte strings into my buffers and strings (modulo
> bugs, of course).

I don't follow.  Not that it matters.

My point was simply if you stay 100% within multibyte, it all works, and if
you stay 100% in unibyte it all works, and it's only when you mix them two
that things don't work.  So the problem is neither with unibyte nor with
multibyte but with their interaction: the problem takes its root in the
conflation of the concept of byte and the concept of char.


        Stefan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-23 11:45                           ` Richard Stallman
@ 2006-10-23 19:16                             ` Stefan Monnier
  2006-10-24 17:43                               ` Richard Stallman
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Monnier @ 2006-10-23 19:16 UTC (permalink / raw)
  Cc: id.brep, emacs-pretest-bug, yamaoka, ding

>     This said, I agree that Emacs should help more.  E.g. by signalling an
>     error when trying to insert multibyte text into a unibyte buffer.

> This operation converts the string to unibyte.

Indeed.  Using a default (and poorly specified) encoding method.

> It works correctly, provided the characters in that string can be
> expressed in the unibyte buffer.

But which characters can be expressed is poorly specified.  E.g. Tell me
which chars can be expressed in a unibyte buffer in a BIG5 locale?

> If people generally agree it would be better to signal an error,
> we could do that.  However, that would cause trouble trying to use
> M-y to move past multibyte entries in the kill ring to reach the
> unibyte entry you really want.

When the insertion is a user-level operation, the elisp code should make
sure to manually do the encoding/decoding, using e.g. the default file
coding-system.


        Stefan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-23 19:11                             ` Stefan Monnier
@ 2006-10-23 20:06                               ` Eli Zaretskii
  2006-10-23 20:49                                 ` Stefan Monnier
  0 siblings, 1 reply; 45+ messages in thread
From: Eli Zaretskii @ 2006-10-23 20:06 UTC (permalink / raw)
  Cc: yamaoka, emacs-pretest-bug, id.brep, ding

> Cc: yamaoka@jpl.org,  emacs-pretest-bug@gnu.org,  id.brep@gmail.com,
> 	  ding@gnus.org
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Mon, 23 Oct 2006 15:11:09 -0400
> 
> My point was simply if you stay 100% within multibyte, it all works, and if
> you stay 100% in unibyte it all works

The former is true, the latter isn't, AFAIK.  ``Normal'' Emacs
primitives and subroutines always do TRT with multibyte strings, while
with unibyte you need to be careful which ones you call.  That was my
point, and the case that started this thread is my evidence.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-23 20:06                               ` Eli Zaretskii
@ 2006-10-23 20:49                                 ` Stefan Monnier
  2006-10-24  4:17                                   ` Eli Zaretskii
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Monnier @ 2006-10-23 20:49 UTC (permalink / raw)
  Cc: emacs-pretest-bug, yamaoka, id.brep, ding

>> My point was simply if you stay 100% within multibyte, it all works, and if
>> you stay 100% in unibyte it all works

> The former is true, the latter isn't, AFAIK.  ``Normal'' Emacs
> primitives and subroutines always do TRT with multibyte strings, while
> with unibyte you need to be careful which ones you call.

Care to give an example of what you're thinking about, where purely unibyte
strings and buffers are not properly handled?
After all, such cases are probably bugs.

> That was my point, and the case that started this thread is my evidence.

I must have misunderstood because from what I read in this thread I thought
the problem was due to the fact that one part of the code is using unibyte
strings (for group names) and it's apparently messed up somewhere because it
gets mixed with multibyte data.

Sorry I misunderstood and went on with a rant.


        Stefan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-23 20:49                                 ` Stefan Monnier
@ 2006-10-24  4:17                                   ` Eli Zaretskii
  2006-10-24 15:22                                     ` Stefan Monnier
  0 siblings, 1 reply; 45+ messages in thread
From: Eli Zaretskii @ 2006-10-24  4:17 UTC (permalink / raw)
  Cc: emacs-pretest-bug, yamaoka, id.brep, ding

> Cc: yamaoka@jpl.org,  emacs-pretest-bug@gnu.org,  id.brep@gmail.com,
> 	  ding@gnus.org
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Mon, 23 Oct 2006 16:49:59 -0400
> 
> >> My point was simply if you stay 100% within multibyte, it all works, and if
> >> you stay 100% in unibyte it all works
> 
> > The former is true, the latter isn't, AFAIK.  ``Normal'' Emacs
> > primitives and subroutines always do TRT with multibyte strings, while
> > with unibyte you need to be careful which ones you call.
> 
> Care to give an example of what you're thinking about, where purely unibyte
> strings and buffers are not properly handled?

Are you talking about a unibyte Emacs session?  If so, that's not what
I had in mind.  I'm talking about using unibyte strings in a multibyte
session.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-24  4:17                                   ` Eli Zaretskii
@ 2006-10-24 15:22                                     ` Stefan Monnier
  2006-10-24 17:27                                       ` Eli Zaretskii
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Monnier @ 2006-10-24 15:22 UTC (permalink / raw)
  Cc: emacs-pretest-bug, yamaoka, id.brep, ding

>> >> My point was simply if you stay 100% within multibyte, it all works,
>> >> and if you stay 100% in unibyte it all works
>> 
>> > The former is true, the latter isn't, AFAIK.  ``Normal'' Emacs
>> > primitives and subroutines always do TRT with multibyte strings, while
>> > with unibyte you need to be careful which ones you call.
>> 
>> Care to give an example of what you're thinking about, where purely unibyte
>> strings and buffers are not properly handled?

> Are you talking about a unibyte Emacs session?  If so, that's not what
> I had in mind.  I'm talking about using unibyte strings in a multibyte
> session.

I'm not quite sure what is a "unibyte session", but I think "stay 100% in
unibyte" is fairly clear: only use unibyte buffers and strings in the
relevant code (while other unrelated buffers and strings may be multibyte).
So I think we're thinking about the same situation.


        Stefan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-24 15:22                                     ` Stefan Monnier
@ 2006-10-24 17:27                                       ` Eli Zaretskii
  2006-10-24 18:03                                         ` Stefan Monnier
  2006-10-25 18:02                                         ` Richard Stallman
  0 siblings, 2 replies; 45+ messages in thread
From: Eli Zaretskii @ 2006-10-24 17:27 UTC (permalink / raw)
  Cc: emacs-pretest-bug, yamaoka, id.brep, ding

> Cc: yamaoka@jpl.org,  emacs-pretest-bug@gnu.org,  id.brep@gmail.com,
> 	  ding@gnus.org
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Tue, 24 Oct 2006 11:22:51 -0400
> 
> I'm not quite sure what is a "unibyte session"

A.k.a. "emacs --unibyte".

> but I think "stay 100% in
> unibyte" is fairly clear: only use unibyte buffers and strings in the
> relevant code (while other unrelated buffers and strings may be multibyte).

I think it's practically impossible to use only unibyte buffers for
any serious work, and therefore I don't consider this a feasible
solution.

If one uses the default multibyte session, using unibyte strings is
prone to subtle problems as described in this thread.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-23 19:16                             ` Stefan Monnier
@ 2006-10-24 17:43                               ` Richard Stallman
  2006-10-24 18:14                                 ` Stefan Monnier
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Stallman @ 2006-10-24 17:43 UTC (permalink / raw)
  Cc: eliz, emacs-pretest-bug, yamaoka, id.brep, ding

    > It works correctly, provided the characters in that string can be
    > expressed in the unibyte buffer.

    But which characters can be expressed is poorly specified.  E.g. Tell me
    which chars can be expressed in a unibyte buffer in a BIG5 locale?

Mentioning the locale is somewhat of a red herring, since what controls
this conversion is (effectively) nonascii-insert-offset.

Mentioning BIG5 is a second red herring.  You can't represent Chinese
in 8-bit characters, but that is not Emacs' fault.

Do you think that we need to document nonascii-insert-offset more
prominently?  If so, where else should we talk about it?

    > If people generally agree it would be better to signal an error,
    > we could do that.  However, that would cause trouble trying to use
    > M-y to move past multibyte entries in the kill ring to reach the
    > unibyte entry you really want.

    When the insertion is a user-level operation, the elisp code should make
    sure to manually do the encoding/decoding, using e.g. the default file
    coding-system.

I don't understand -- could you be more specific?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-24 17:27                                       ` Eli Zaretskii
@ 2006-10-24 18:03                                         ` Stefan Monnier
  2006-10-25 18:02                                         ` Richard Stallman
  1 sibling, 0 replies; 45+ messages in thread
From: Stefan Monnier @ 2006-10-24 18:03 UTC (permalink / raw)
  Cc: emacs-pretest-bug, yamaoka, id.brep, ding

>> I'm not quite sure what is a "unibyte session"
> A.k.a. "emacs --unibyte".

I know that, but I'm not quite sure what it entails.  This discussion is
within the scope of code such as Gnus's, i.e. code which should work
either way.

>> but I think "stay 100% in
>> unibyte" is fairly clear: only use unibyte buffers and strings in the
>> relevant code (while other unrelated buffers and strings may be multibyte).

> I think it's practically impossible to use only unibyte buffers for
> any serious work, and therefore I don't consider this a feasible
> solution.

The operative term there is "in the relevant code".  E.g. Gnus could easily
(as opposed to "practically impossible") use unibyte for all its buffers
and strings.  It's also very common (and often necessary) to use unibyte
buffers and strings to interact with underlying processes or network
connections.  Typically because the data passed back&forth may use mixes of
various encodings.

> If one uses the default multibyte session, using unibyte strings is
> prone to subtle problems as described in this thread.

But those problems are not specific to unibyte, but to the mix of unibyte
and multibyte.  In most packages such as Gnus it's just as hard/impossible
to use only multibyte as it is to use only unibyte.


        Stefan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-24 17:43                               ` Richard Stallman
@ 2006-10-24 18:14                                 ` Stefan Monnier
  2006-10-25 18:03                                   ` Richard Stallman
  2006-10-25 18:03                                   ` Richard Stallman
  0 siblings, 2 replies; 45+ messages in thread
From: Stefan Monnier @ 2006-10-24 18:14 UTC (permalink / raw)
  Cc: id.brep, emacs-pretest-bug, yamaoka, ding

>> It works correctly, provided the characters in that string can be
>> expressed in the unibyte buffer.

>     But which characters can be expressed is poorly specified.  E.g. Tell me
>     which chars can be expressed in a unibyte buffer in a BIG5 locale?

> Mentioning the locale is somewhat of a red herring, since what controls
> this conversion is (effectively) nonascii-insert-offset.

The nonascii-insert-offset and noonascii-translation-table is AFAIK
initialized differently depending on the locale (and/or language
environment) and users typically don't fidle with that table directly but
via their locale setting instead.

> Mentioning BIG5 is a second red herring.  You can't represent Chinese
> in 8-bit characters, but that is not Emacs' fault.

Code which implicitly converts text from multibyte to unibyte (and vice
versa), using nonascii-*, will presumably be used in all kinds of locales,
including BIG5 ones.  So knowing what happens in this case is
still relevant.

> Do you think that we need to document nonascii-insert-offset more
> prominently?  If so, where else should we talk about it?

No, I think we should kill it instead and declare in error any code which
tries to use it.  It made sense in Emacs-20 when the multibyte support was
weaker, but nowadays it just encourages sloppy code which breaks down in
different language environments.

>> If people generally agree it would be better to signal an error,
>> we could do that.  However, that would cause trouble trying to use
>> M-y to move past multibyte entries in the kill ring to reach the
>> unibyte entry you really want.

>     When the insertion is a user-level operation, the elisp code should make
>     sure to manually do the encoding/decoding, using e.g. the default file
>     coding-system.

> I don't understand -- could you be more specific?

C-y/M-y uses `insert' somewhere internally.  My suggestion is to make
`insert' signal an error when faced with the need to insert a multibyte
string in a unibyte buffer.  This doesn't mean that C-y/M-y should propagate
this error.


        Stefan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-24 17:27                                       ` Eli Zaretskii
  2006-10-24 18:03                                         ` Stefan Monnier
@ 2006-10-25 18:02                                         ` Richard Stallman
  2006-10-25 20:22                                           ` Eli Zaretskii
  1 sibling, 1 reply; 45+ messages in thread
From: Richard Stallman @ 2006-10-25 18:02 UTC (permalink / raw)
  Cc: id.brep, emacs-pretest-bug, yamaoka, ding

    If one uses the default multibyte session, using unibyte strings is
    prone to subtle problems as described in this thread.

I was not following the thread.  Could you explain the problem
that was encountered?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-24 18:14                                 ` Stefan Monnier
@ 2006-10-25 18:03                                   ` Richard Stallman
  2006-10-25 18:03                                   ` Richard Stallman
  1 sibling, 0 replies; 45+ messages in thread
From: Richard Stallman @ 2006-10-25 18:03 UTC (permalink / raw)
  Cc: id.brep, emacs-pretest-bug, yamaoka, ding

    Code which implicitly converts text from multibyte to unibyte (and vice
    versa), using nonascii-*, will presumably be used in all kinds of locales,
    including BIG5 ones.  So knowing what happens in this case is
    still relevant.

It is not hard to know what happens--that is documented in the Lisp
Manual.  (Do you think any of it is not clear?)

Meanwhile, I think that the presumption of the above text is incorrect.
Unibyte text can only handle certain European alphabets.  If you use
unibyte text, you should make sure to use it only for them.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-24 18:14                                 ` Stefan Monnier
  2006-10-25 18:03                                   ` Richard Stallman
@ 2006-10-25 18:03                                   ` Richard Stallman
  2006-10-27  2:48                                     ` Kenichi Handa
  1 sibling, 1 reply; 45+ messages in thread
From: Richard Stallman @ 2006-10-25 18:03 UTC (permalink / raw)
  Cc: id.brep, emacs-pretest-bug, yamaoka, ding

    C-y/M-y uses `insert' somewhere internally.  My suggestion is to make
    `insert' signal an error when faced with the need to insert a multibyte
    string in a unibyte buffer.  This doesn't mean that C-y/M-y should propagate
    this error.

That might work.  We could try it, after the release.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-25 18:02                                         ` Richard Stallman
@ 2006-10-25 20:22                                           ` Eli Zaretskii
  2006-10-26  8:52                                             ` Richard Stallman
  0 siblings, 1 reply; 45+ messages in thread
From: Eli Zaretskii @ 2006-10-25 20:22 UTC (permalink / raw)
  Cc: id.brep, emacs-pretest-bug, yamaoka, ding

> From: Richard Stallman <rms@gnu.org>
> CC: monnier@iro.umontreal.ca, emacs-pretest-bug@gnu.org,
> 	yamaoka@jpl.org, id.brep@gmail.com, ding@gnus.org
> Date: Wed, 25 Oct 2006 14:02:00 -0400
> 
>     If one uses the default multibyte session, using unibyte strings is
>     prone to subtle problems as described in this thread.
> 
> I was not following the thread.  Could you explain the problem
> that was encountered?

Gnus stored a name of a news group in encoded form.  Manipulating that
encoded name as a normal Emacs string caused some weird problem (I no
more remember the details, but I'm hardly surprised that it happened).

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-25 20:22                                           ` Eli Zaretskii
@ 2006-10-26  8:52                                             ` Richard Stallman
  2006-10-27  8:05                                               ` Eli Zaretskii
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Stallman @ 2006-10-26  8:52 UTC (permalink / raw)
  Cc: id.brep, emacs-pretest-bug, yamaoka, ding

    Gnus stored a name of a news group in encoded form.

There is a big difference between unibyte strings and encoded unibyte
strings.  The latter indeed requires a lot of special care.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-25 18:03                                   ` Richard Stallman
@ 2006-10-27  2:48                                     ` Kenichi Handa
  0 siblings, 0 replies; 45+ messages in thread
From: Kenichi Handa @ 2006-10-27  2:48 UTC (permalink / raw)
  Cc: emacs-pretest-bug, id.brep, ding, yamaoka

In article <E1Gcn5p-0006bW-Uy@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

>     C-y/M-y uses `insert' somewhere internally.  My suggestion is to make
>     `insert' signal an error when faced with the need to insert a multibyte
>     string in a unibyte buffer.  This doesn't mean that C-y/M-y should propagate
>     this error.

> That might work.  We could try it, after the release.

Stefan, how about start trying it in emacs-unicode-2 now?  I
generally agree with your view about unibyte<->multibyte
problem.  You also proposed to change the current automatic
unibyte->multibyte conversion from string-make-multibyte
method to string-to-multibyte method a while ago, didn't
you?  I think that change is good too.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-26  8:52                                             ` Richard Stallman
@ 2006-10-27  8:05                                               ` Eli Zaretskii
  2006-10-27 13:33                                                 ` Richard Stallman
  0 siblings, 1 reply; 45+ messages in thread
From: Eli Zaretskii @ 2006-10-27  8:05 UTC (permalink / raw)
  Cc: id.brep, emacs-pretest-bug, yamaoka, ding

> From: Richard Stallman <rms@gnu.org>
> CC: monnier@iro.umontreal.ca, emacs-pretest-bug@gnu.org,
> 	yamaoka@jpl.org, id.brep@gmail.com, ding@gnus.org
> Date: Thu, 26 Oct 2006 04:52:56 -0400
> 
> There is a big difference between unibyte strings and encoded unibyte
> strings.

What is that difference?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-27  8:05                                               ` Eli Zaretskii
@ 2006-10-27 13:33                                                 ` Richard Stallman
  2006-10-27 14:27                                                   ` Stefan Monnier
  2006-10-28 10:28                                                   ` Eli Zaretskii
  0 siblings, 2 replies; 45+ messages in thread
From: Richard Stallman @ 2006-10-27 13:33 UTC (permalink / raw)
  Cc: id.brep, emacs-pretest-bug, yamaoka, ding

    > There is a big difference between unibyte strings and encoded unibyte
    > strings.

    What is that difference?

You can represent one of Emacs' supported Latin alphabets in
(unencoded) unibyte strings, and Emacs will automatically convert to
and from multibyte.

However, if you store encoded text in unibyte strings, you are
responsible for decoding and encoding when necessary.  You have to
keep track, everywhere, of whether the data is encoded or not.

We implemented the ability to do encoding manually because sometimes
it is necessary to decode parts of a file in different ways (e.g.,
mailboxes).

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-27 13:33                                                 ` Richard Stallman
@ 2006-10-27 14:27                                                   ` Stefan Monnier
  2006-10-28 18:13                                                     ` Richard Stallman
  2006-10-28 10:28                                                   ` Eli Zaretskii
  1 sibling, 1 reply; 45+ messages in thread
From: Stefan Monnier @ 2006-10-27 14:27 UTC (permalink / raw)
  Cc: id.brep, emacs-pretest-bug, yamaoka, ding

> You can represent one of Emacs' supported Latin alphabets in
> (unencoded) unibyte strings, and Emacs will automatically convert to
> and from multibyte.

And this use was very convenient for Emacs-20 where we wanted to keep some
backward compatibility with code that was not MULE-aware.

But nowadays any code which relies on this is simply broken, AFAIC, because
it'll only work in environments using a iso-8859 encoding (more or less) and
will thus be unusable with in asian environments or in utf-8 (which is very
quickly taking over the iso-8859 world).

> However, if you store encoded text in unibyte strings, you are
> responsible for decoding and encoding when necessary.  You have to
> keep track, everywhere, of whether the data is encoded or not.

It's pretty easy to keep track of it: unibyte == encoded, multibyte
== decoded.


        Stefan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-27 13:33                                                 ` Richard Stallman
  2006-10-27 14:27                                                   ` Stefan Monnier
@ 2006-10-28 10:28                                                   ` Eli Zaretskii
  2006-10-29 18:45                                                     ` Richard Stallman
  1 sibling, 1 reply; 45+ messages in thread
From: Eli Zaretskii @ 2006-10-28 10:28 UTC (permalink / raw)
  Cc: emacs-pretest-bug, yamaoka, id.brep, ding

> From: Richard Stallman <rms@gnu.org>
> CC: monnier@iro.umontreal.ca, emacs-pretest-bug@gnu.org,
> 	yamaoka@jpl.org, id.brep@gmail.com, ding@gnus.org
> Date: Fri, 27 Oct 2006 09:33:35 -0400
> 
>     > There is a big difference between unibyte strings and encoded unibyte
>     > strings.
> 
>     What is that difference?
> 
> You can represent one of Emacs' supported Latin alphabets in
> (unencoded) unibyte strings, and Emacs will automatically convert to
> and from multibyte.

AFAIK, Latin-N unibyte strings and iso-8859-N text encoded in Latin-N
use the same numerical codes for the same characters, so they are
indistinguishable.

Handa-san, am I right?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-27 14:27                                                   ` Stefan Monnier
@ 2006-10-28 18:13                                                     ` Richard Stallman
  0 siblings, 0 replies; 45+ messages in thread
From: Richard Stallman @ 2006-10-28 18:13 UTC (permalink / raw)
  Cc: id.brep, emacs-pretest-bug, yamaoka, ding

    > However, if you store encoded text in unibyte strings, you are
    > responsible for decoding and encoding when necessary.  You have to
    > keep track, everywhere, of whether the data is encoded or not.

    It's pretty easy to keep track of it: unibyte == encoded, multibyte
    == decoded.

What you're proposing is a convention which a certain program could
use internally.  It might be a workable convention for some purposes.
But it is not automatic, and not required by Emacs.

    > You can represent one of Emacs' supported Latin alphabets in
    > (unencoded) unibyte strings, and Emacs will automatically convert to
    > and from multibyte.

    And this use was very convenient for Emacs-20 where we wanted to keep some
    backward compatibility with code that was not MULE-aware.

    But nowadays any code which relies on this is simply broken, AFAIC, because
    it'll only work in environments using a iso-8859 encoding (more or less)

I think you're mistaken.  The conversion between unibyte and multibyte
involves internal Emacs characters.  It concerns character sets, not
coding systems.

However, it is true that the use of unibyte strings is only applicable
to alphabets such as could be represented in unibyte strings.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: `.newsrc.eld' saves chinese group name in wrong coding
  2006-10-28 10:28                                                   ` Eli Zaretskii
@ 2006-10-29 18:45                                                     ` Richard Stallman
  0 siblings, 0 replies; 45+ messages in thread
From: Richard Stallman @ 2006-10-29 18:45 UTC (permalink / raw)
  Cc: emacs-pretest-bug, yamaoka, id.brep, ding, handa

    > You can represent one of Emacs' supported Latin alphabets in
    > (unencoded) unibyte strings, and Emacs will automatically convert to
    > and from multibyte.

    AFAIK, Latin-N unibyte strings and iso-8859-N text encoded in Latin-N
    use the same numerical codes for the same characters, so they are
    indistinguishable.

I think that is true, but if that's what you're doing, you'll
understand it better if you think "unibyte representations of these Emacs
characters" rather than "encoded in a coding system".

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2006-10-29 18:45 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <ufydnay6j.fsf@gmail.com>
2006-10-19  2:54 ` `.newsrc.eld' saves chinese group name in wrong coding Chong Yidong
2006-10-19  3:56   ` Katsumi Yamaoka
2006-10-19  4:11     ` Katsumi Yamaoka
2006-10-19  8:33       ` Reiner Steib
2006-10-19  9:03         ` Katsumi Yamaoka
2006-10-20  3:39           ` Chong Yidong
2006-10-20  4:06             ` Katsumi Yamaoka
2006-10-20  5:18               ` Katsumi Yamaoka
2006-10-20  6:04           ` Eli Zaretskii
2006-10-20  6:21             ` Katsumi Yamaoka
2006-10-20  6:38               ` Eli Zaretskii
2006-10-20  8:59                 ` Katsumi Yamaoka
2006-10-21  2:03                   ` Richard Stallman
2006-10-22 23:28                     ` Katsumi Yamaoka
2006-10-23 11:45                       ` Richard Stallman
2006-10-20 19:19                 ` Stefan Monnier
2006-10-20 20:30                   ` Eli Zaretskii
2006-10-20 22:06                     ` Stefan Monnier
2006-10-21  9:22                       ` Eli Zaretskii
2006-10-23  3:55                         ` Stefan Monnier
2006-10-23  4:16                           ` Eli Zaretskii
2006-10-23 19:11                             ` Stefan Monnier
2006-10-23 20:06                               ` Eli Zaretskii
2006-10-23 20:49                                 ` Stefan Monnier
2006-10-24  4:17                                   ` Eli Zaretskii
2006-10-24 15:22                                     ` Stefan Monnier
2006-10-24 17:27                                       ` Eli Zaretskii
2006-10-24 18:03                                         ` Stefan Monnier
2006-10-25 18:02                                         ` Richard Stallman
2006-10-25 20:22                                           ` Eli Zaretskii
2006-10-26  8:52                                             ` Richard Stallman
2006-10-27  8:05                                               ` Eli Zaretskii
2006-10-27 13:33                                                 ` Richard Stallman
2006-10-27 14:27                                                   ` Stefan Monnier
2006-10-28 18:13                                                     ` Richard Stallman
2006-10-28 10:28                                                   ` Eli Zaretskii
2006-10-29 18:45                                                     ` Richard Stallman
2006-10-23 11:45                           ` Richard Stallman
2006-10-23 19:16                             ` Stefan Monnier
2006-10-24 17:43                               ` Richard Stallman
2006-10-24 18:14                                 ` Stefan Monnier
2006-10-25 18:03                                   ` Richard Stallman
2006-10-25 18:03                                   ` Richard Stallman
2006-10-27  2:48                                     ` Kenichi Handa
2006-10-21  1:01                 ` Kenichi Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).