Gnus development mailing list
 help / color / mirror / Atom feed
* [Unicode-2] `read' always returns multibyte symbol
@ 2007-11-13  9:41 Katsumi Yamaoka
  2007-11-13 12:55 ` Kenichi Handa
  2007-11-13 15:07 ` Stefan Monnier
  0 siblings, 2 replies; 23+ messages in thread
From: Katsumi Yamaoka @ 2007-11-13  9:41 UTC (permalink / raw)
  To: emacs-devel; +Cc: ding

Hi,

The following Lisp snippet emulates what Gnus does when reading
active data for the local.テスト newsgroup.  The buffer contains
data which have been retrieved from the nntp server.  Note that
the newsgroup name contains non-ASCII characters, which has been
encoded by utf-8 in the server.

--8<---------------cut here---------------start------------->8---
(let ((string (encode-coding-string "local.テスト" 'utf-8)))
  (with-temp-buffer
    (set-buffer-multibyte t)
    (insert (string-to-multibyte string))
    (goto-char (point-min))
    (multibyte-string-p (symbol-name (read (current-buffer))))))
--8<---------------cut here---------------end--------------->8---

While Emacs trunk returns nil for this, Emacs Unicode-2 returns t.

If it is not intentional, I hope `read' behaves just like it does
in Emacs trunk.  Otherwise, is there a way to make `read' return
a unibyte symbol (without slowing down)?

In the inside of Gnus, non-ASCII group names are all treated as
unibyte strings, that are the ones that the server has encoded
with certain coding systems.  Because of the present behavior of
`read' in Emacs Unicode-2, Gnus doesn't work with such newsgroups
perfectly.  You can find the actual code in gnus-start.el as
follows:

--8<---------------cut here---------------start------------->8---
;; Read an active file and place the results in `gnus-active-hashtb'.
(defun gnus-active-to-gnus-format (&optional method hashtb ignore-errors
					     real-active)
[...]
	      ;; group gets set to a symbol interned in the hash table
	      ;; (what a hack!!) - jwz
	      (setq group (let ((obarray hashtb)) (read cur)))
--8<---------------cut here---------------end--------------->8---

As you can see, it needs to work fast because there might be a
lot of newsgroups.  So, if possible, I don't want to modify it
into:

--8<---------------cut here---------------start------------->8---
 (setq group (intern (mm-string-as-unibyte (symbol-name (read cur))) hashtb))
--8<---------------cut here---------------end--------------->8---

Regards,



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-13  9:41 [Unicode-2] `read' always returns multibyte symbol Katsumi Yamaoka
@ 2007-11-13 12:55 ` Kenichi Handa
  2007-11-13 15:10   ` Stefan Monnier
  2007-11-14  3:56   ` Katsumi Yamaoka
  2007-11-13 15:07 ` Stefan Monnier
  1 sibling, 2 replies; 23+ messages in thread
From: Kenichi Handa @ 2007-11-13 12:55 UTC (permalink / raw)
  To: Katsumi Yamaoka; +Cc: ding, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 3483 bytes --]

In article <b4moddycwjv.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes:

> The following Lisp snippet emulates what Gnus does when reading
> active data for the local.テスト newsgroup.  The buffer contains
> data which have been retrieved from the nntp server.  Note that
> the newsgroup name contains non-ASCII characters, which has been
> encoded by utf-8 in the server.

> --8<---------------cut here---------------start------------->8---
> (let ((string (encode-coding-string "local.テスト" 'utf-8)))
>   (with-temp-buffer
>     (set-buffer-multibyte t)
>     (insert (string-to-multibyte string))
>     (goto-char (point-min))
>     (multibyte-string-p (symbol-name (read (current-buffer))))))
> --8<---------------cut here---------------end--------------->8---

> While Emacs trunk returns nil for this, Emacs Unicode-2 returns t.

That is because `read' decides the name is unibyte or
multibyte by whether the name is a valid multibyte sequence
or not.  In the trunk, utf-8 byte sequecne is not a valid
multibyte sequecne, but in emacs-unicode-2, it is valid.

> If it is not intentional, I hope `read' behaves just like it does
> in Emacs trunk.

The relevant code for `read' is very complicated and I want
to avoid touching it if there's another way.

In addition, I think it is the right thing that the above
code return t; i.e. any symbol created by reading a
multibyte buffer should have a multibyte string name.  The
bug to fix is that the following code also returns t in
emacs-unicode-2.

< --8<---------------cut here---------------start------------->8---
< (let ((string (encode-coding-string "local.テスト" 'utf-8)))
<   (with-temp-buffer
<     (set-buffer-multibyte nil)
<     (insert string)
<     (goto-char (point-min))
<     (multibyte-string-p (symbol-name (read (current-buffer))))))
< --8<---------------cut here---------------end--------------->8---

> Otherwise, is there a way to make `read' return a unibyte
> symbol (without slowing down)?

The replacement of the above code is simple as this:

(multibyte-string-p (intern (encode-coding-string "local.テスト" 'utf-8)))

But, hmmm, it seems that we can't use such a code in gnus...

> In the inside of Gnus, non-ASCII group names are all treated as
> unibyte strings, that are the ones that the server has encoded
> with certain coding systems.  Because of the present behavior of
> `read' in Emacs Unicode-2, Gnus doesn't work with such newsgroups
> perfectly.  You can find the actual code in gnus-start.el as
> follows:

> --8<---------------cut here---------------start------------->8---
> ;; Read an active file and place the results in `gnus-active-hashtb'.
> (defun gnus-active-to-gnus-format (&optional method hashtb ignore-errors
> 					     real-active)
> [...]
> 	      ;; group gets set to a symbol interned in the hash table
> 	      ;; (what a hack!!) - jwz
> 	      (setq group (let ((obarray hashtb)) (read cur)))
> --8<---------------cut here---------------end--------------->8---

How about this?

(setq group
       (let ((obarray hashtb) pos)
	 (skip-syntax-forward "^w_")
	 (setq pos (point))
	 (skip-syntax-forward "w_")
	 (intern (buffer-substring pos (point)))))

I think the overhead is just several more function calls.  The
actual task (searching for a range of symbol constituents,
make string from them, and intern it) is almost the same.

---
Kenichi Handa
handa@ni.aist.go.jp

[-- Attachment #2: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-13  9:41 [Unicode-2] `read' always returns multibyte symbol Katsumi Yamaoka
  2007-11-13 12:55 ` Kenichi Handa
@ 2007-11-13 15:07 ` Stefan Monnier
  1 sibling, 0 replies; 23+ messages in thread
From: Stefan Monnier @ 2007-11-13 15:07 UTC (permalink / raw)
  To: Katsumi Yamaoka; +Cc: ding, emacs-devel

> --8<---------------cut here---------------start------------->8---
> (let ((string (encode-coding-string "local.テスト" 'utf-8)))
>   (with-temp-buffer
>     (set-buffer-multibyte t)
>     (insert (string-to-multibyte string))
>     (goto-char (point-min))
>     (multibyte-string-p (symbol-name (read (current-buffer))))))
> --8<---------------cut here---------------end--------------->8---

I'm not sure what Emacs should do in such a case, but in the example
above, using a multibyte buffer is asking for trouble.
Can't Gnus use a unibyte buffer in its corresponding code?  That would
speed things up, save you the use of string-to-multibyte, and make it
crystal clear that the result should be unibyte.


        Stefan "trying hard not to say that the use of a multibyte
                buffer here is a plain bug ;-)"

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-13 12:55 ` Kenichi Handa
@ 2007-11-13 15:10   ` Stefan Monnier
  2007-11-14  4:53     ` Kenichi Handa
  2007-11-14  3:56   ` Katsumi Yamaoka
  1 sibling, 1 reply; 23+ messages in thread
From: Stefan Monnier @ 2007-11-13 15:10 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: Katsumi Yamaoka, ding, emacs-devel

> That is because `read' decides the name is unibyte or multibyte by
> whether the name is a valid multibyte sequence or not.

Yuck.

> The bug to fix is that the following code also returns t in
> emacs-unicode-2.

> < --8<---------------cut here---------------start------------->8---
> < (let ((string (encode-coding-string "local.テスト" 'utf-8)))
> <   (with-temp-buffer
> <     (set-buffer-multibyte nil)
> <     (insert string)
> <     (goto-char (point-min))
> <     (multibyte-string-p (symbol-name (read (current-buffer))))))
> < --8<---------------cut here---------------end--------------->8---

Yes, that's a clear bug.


        Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-13 12:55 ` Kenichi Handa
  2007-11-13 15:10   ` Stefan Monnier
@ 2007-11-14  3:56   ` Katsumi Yamaoka
  2007-11-14 11:39     ` Katsumi Yamaoka
  1 sibling, 1 reply; 23+ messages in thread
From: Katsumi Yamaoka @ 2007-11-14  3:56 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: ding, emacs-devel

>>>>> Kenichi Handa <handa@ni.aist.go.jp> wrote:

> In addition, I think it is the right thing that the above
> code return t; i.e. any symbol created by reading a
> multibyte buffer should have a multibyte string name.

I agree with that behavior.

> The bug to fix is that the following code also returns t in
> emacs-unicode-2.

> < --8<---------------cut here---------------start------------->8---
> < (let ((string (encode-coding-string "local.テスト" 'utf-8)))
> <   (with-temp-buffer
> <     (set-buffer-multibyte nil)
> <     (insert string)
> <     (goto-char (point-min))
> <     (multibyte-string-p (symbol-name (read (current-buffer))))))
> < --8<---------------cut here---------------end--------------->8---

Sure.  I'll try using a unibyte buffer to parse active data
(after the bug is fixed).

Regards,

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-13 15:10   ` Stefan Monnier
@ 2007-11-14  4:53     ` Kenichi Handa
  0 siblings, 0 replies; 23+ messages in thread
From: Kenichi Handa @ 2007-11-14  4:53 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: yamaoka, ding, emacs-devel

In article <jwvoddykwt5.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> > The bug to fix is that the following code also returns t in
> > emacs-unicode-2.

> > < --8<---------------cut here---------------start------------->8---
> > < (let ((string (encode-coding-string "local.テスト" 'utf-8)))
> > <   (with-temp-buffer
> > <     (set-buffer-multibyte nil)
> > <     (insert string)
> > <     (goto-char (point-min))
> > <     (multibyte-string-p (symbol-name (read (current-buffer))))))
> > < --8<---------------cut here---------------end--------------->8---

> Yes, that's a clear bug.

I've just installed a fix.

---
Kenichi Handa
handa@ni.aist.go.jp

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-14  3:56   ` Katsumi Yamaoka
@ 2007-11-14 11:39     ` Katsumi Yamaoka
  2007-11-14 14:52       ` Stefan Monnier
  2007-11-15 10:20       ` Katsumi Yamaoka
  0 siblings, 2 replies; 23+ messages in thread
From: Katsumi Yamaoka @ 2007-11-14 11:39 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: ding, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 509 bytes --]

>>>>> Katsumi Yamaoka wrote:

> I'll try using a unibyte buffer to parse active data (after the bug
> is fixed).

Handa-san, thank you for the fix in Unicode-2.  I've also made a
change in the Gnus CVS trunk so that it may use a unibyte buffer.
Now it works not only with Emacs 23.0.60 but also with Emacs 22.1,
22.1.50, and 23.0.50.

BTW, I found another problem with Emacs 21 (Gnus still supports
Emacs 21, IIUC).  So, I'll go on looking into it further.

The diff between Gnus trunk and Unicode-2 is here:

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 1379 bytes --]

*** gnus-start.el~	Sun Nov 11 21:51:22 2007
--- gnus-start.el	Wed Nov 14 11:32:28 2007
***************
*** 2106,2112 ****
  			    (if (equal method gnus-select-method)
  				(gnus-make-hashtable
  				 (count-lines (point-min) (point-max)))
! 			      (gnus-make-hashtable 4096)))))))
      ;; Delete unnecessary lines.
      (goto-char (point-min))
      (cond
--- 2106,2113 ----
  			    (if (equal method gnus-select-method)
  				(gnus-make-hashtable
  				 (count-lines (point-min) (point-max)))
! 			      (gnus-make-hashtable 4096))))))
! 	group max min)
      ;; Delete unnecessary lines.
      (goto-char (point-min))
      (cond
***************
*** 2141,2148 ****
  		      (insert prefix)
  		      (zerop (forward-line 1)))))))
      ;; Store the active file in a hash table.
!     (goto-char (point-min))
!     (let (group max min)
        (while (not (eobp))
  	(condition-case ()
  	    (progn
--- 2142,2153 ----
  		      (insert prefix)
  		      (zerop (forward-line 1)))))))
      ;; Store the active file in a hash table.
!     ;; Use a unibyte buffer in order to make `read' read non-ASCII
!     ;; group names (which have been encoded) as unibyte strings.
!     (mm-with-unibyte-buffer
!       (insert-buffer-substring cur)
!       (setq cur (current-buffer))
!       (goto-char (point-min))
        (while (not (eobp))
  	(condition-case ()
  	    (progn

[-- Attachment #3: Type: text/plain, Size: 9 bytes --]

Regards,

[-- Attachment #4: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-14 11:39     ` Katsumi Yamaoka
@ 2007-11-14 14:52       ` Stefan Monnier
  2007-11-14 23:52         ` Katsumi Yamaoka
  2007-11-15 10:20       ` Katsumi Yamaoka
  1 sibling, 1 reply; 23+ messages in thread
From: Stefan Monnier @ 2007-11-14 14:52 UTC (permalink / raw)
  To: Katsumi Yamaoka; +Cc: Kenichi Handa, ding, emacs-devel

> !     ;; Use a unibyte buffer in order to make `read' read non-ASCII
> !     ;; group names (which have been encoded) as unibyte strings.
> !     (mm-with-unibyte-buffer
> !       (insert-buffer-substring cur)

Why is `cur' a multibyte buffer?  Since it contains encoded strings, I'd
expect it would be better (more robust and convenient) to use a unibyte
buffer for it.


        Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-14 14:52       ` Stefan Monnier
@ 2007-11-14 23:52         ` Katsumi Yamaoka
  2007-11-15  1:15           ` Stefan Monnier
  0 siblings, 1 reply; 23+ messages in thread
From: Katsumi Yamaoka @ 2007-11-14 23:52 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Kenichi Handa, ding, emacs-devel

>>>>> Stefan Monnier wrote:

>> !     ;; Use a unibyte buffer in order to make `read' read non-ASCII
>> !     ;; group names (which have been encoded) as unibyte strings.
>> !     (mm-with-unibyte-buffer
>> !       (insert-buffer-substring cur)

> Why is `cur' a multibyte buffer?  Since it contains encoded strings, I'd
> expect it would be better (more robust and convenient) to use a unibyte
> buffer for it.

Good point.  The `cur' is `nntp-server-buffer' (" *nntpd*") or
`gnus-work-buffer' (" *gnus work*") as the case may be.  Gnus uses
those buffers for various purposes.  Although there looks no
situation where it is necessary to have multibyte data as far as
I can observe, Gnus explicitly sets them as multibyte buffers (see
`nnheader-init-server-buffer' and `gnus-set-work-buffer').  I
believe the reason they do so is to prevent from breaking data
when copying them to another multibyte buffer (IIUC, copying data
from a multibyte buffer to a unibyte buffer causes no problem).
So, I didn't modify those buffers' multibyteness.  If I introduced
a new unibyte work buffer (such as " *gnus binary work*"), it
required that `gnus-read-active-file-2' binds `nntp-server-buffer'
to it for example.  It is used by all the back ends but I'm not
sure it never causes a problem with them all.

Regards,

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-14 23:52         ` Katsumi Yamaoka
@ 2007-11-15  1:15           ` Stefan Monnier
  2007-11-15  3:01             ` Katsumi Yamaoka
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Monnier @ 2007-11-15  1:15 UTC (permalink / raw)
  To: Katsumi Yamaoka; +Cc: Kenichi Handa, ding, emacs-devel

>>> !     ;; Use a unibyte buffer in order to make `read' read non-ASCII
>>> !     ;; group names (which have been encoded) as unibyte strings.
>>> !     (mm-with-unibyte-buffer
>>> !       (insert-buffer-substring cur)

>> Why is `cur' a multibyte buffer?  Since it contains encoded strings, I'd
>> expect it would be better (more robust and convenient) to use a unibyte
>> buffer for it.

> Good point.  The `cur' is `nntp-server-buffer' (" *nntpd*") or
> `gnus-work-buffer' (" *gnus work*") as the case may be.

Don't know about gnus-work-buffer, but nntp-server-buffer should only
ever contain unibyte data AFAICT, so it would be better to put it in
unibyte mode.

> Gnus uses those buffers for various purposes.  Although there looks no
> situation where it is necessary to have multibyte data as far as I can
> observe, Gnus explicitly sets them as multibyte buffers (see
> `nnheader-init-server-buffer' and `gnus-set-work-buffer').

> I believe the reason they do so is to prevent from breaking data when
> copying them to another multibyte buffer (IIUC, copying data from
> a multibyte buffer to a unibyte buffer causes no problem).

I'm not sure I understand: copying data from a multibyte buffer to
a unibyte buffer is exactly the case that can cause problems.


        Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-15  1:15           ` Stefan Monnier
@ 2007-11-15  3:01             ` Katsumi Yamaoka
  2007-11-15  3:39               ` Stefan Monnier
  0 siblings, 1 reply; 23+ messages in thread
From: Katsumi Yamaoka @ 2007-11-15  3:01 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Kenichi Handa, ding, emacs-devel

>>>>> Stefan Monnier wrote:

> Don't know about gnus-work-buffer, but nntp-server-buffer should only
> ever contain unibyte data AFAICT, so it would be better to put it in
> unibyte mode.

I think it's better, too.  However, there might be a code that
copies data from nntp-server-buffer to a multibyte buffer.  I'm
not capable to check all the Gnus code.

>> (IIUC, copying data from a multibyte buffer to a unibyte buffer
>> causes no problem).

> I'm not sure I understand: copying data from a multibyte buffer to
> a unibyte buffer is exactly the case that can cause problems.

I agree that's generally true.  But in Gnus' case, data in a
multibyte work buffer are the multibyte version of binary data.
I don't know proper words to explain it, sorry.  In other words,
they are the one which `string-to-multibyte' converted binary
data to.  For example:

(with-temp-buffer
  (set-buffer-multibyte t)
  (insert (string-to-multibyte (encode-coding-string "日本語" 'utf-8)))
  (let ((buffer (current-buffer)))
    (with-temp-buffer
      (set-buffer-multibyte nil)
      (insert-buffer-substring buffer)
      (decode-coding-string (buffer-string) 'utf-8))))
 => "日本語"

I'm not sure it works with any data, though.

Regards,

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-15  3:01             ` Katsumi Yamaoka
@ 2007-11-15  3:39               ` Stefan Monnier
  0 siblings, 0 replies; 23+ messages in thread
From: Stefan Monnier @ 2007-11-15  3:39 UTC (permalink / raw)
  To: Katsumi Yamaoka; +Cc: Kenichi Handa, ding, emacs-devel

> I think it's better, too.  However, there might be a code that
> copies data from nntp-server-buffer to a multibyte buffer.  I'm
> not capable to check all the Gnus code.

I understand the desire to avoid changing code, but I think in the long
run it'll pay off.

>>> (IIUC, copying data from a multibyte buffer to a unibyte buffer
>>> causes no problem).

>> I'm not sure I understand: copying data from a multibyte buffer to
>> a unibyte buffer is exactly the case that can cause problems.

> I agree that's generally true.  But in Gnus' case, data in a
> multibyte work buffer are the multibyte version of binary data.
> I don't know proper words to explain it, sorry.  In other words,
> they are the one which `string-to-multibyte' converted binary
> data to.  For example:

> (with-temp-buffer
>   (set-buffer-multibyte t)
>   (insert (string-to-multibyte (encode-coding-string "日本語" 'utf-8)))
>   (let ((buffer (current-buffer)))
>     (with-temp-buffer
>       (set-buffer-multibyte nil)
>       (insert-buffer-substring buffer)
>       (decode-coding-string (buffer-string) 'utf-8))))
>  => "日本語"

> I'm not sure it works with any data, though.

I'm not sure what you're saying.  But IIUC the source buffer in your
example would be nntp-server-buffer, in which case turning it into
unibyte will not introduce any problem.  On the contrary, it'll make it
more obviously correct.


        Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-14 11:39     ` Katsumi Yamaoka
  2007-11-14 14:52       ` Stefan Monnier
@ 2007-11-15 10:20       ` Katsumi Yamaoka
  2007-11-15 11:08         ` Kenichi Handa
  1 sibling, 1 reply; 23+ messages in thread
From: Katsumi Yamaoka @ 2007-11-15 10:20 UTC (permalink / raw)
  To: ding; +Cc: emacs-devel

>>>>> Katsumi Yamaoka wrote:

> BTW, I found another problem with Emacs 21 (Gnus still supports
> Emacs 21, IIUC).  So, I'll go on looking into it further.

I realized a network process that is created by
`open-network-stream' in Emacs 21 breaks encoded non-ASCII group
names if the process buffer is in the multibyte mode even if the
process coding system is binary.  It behaves as if
`toggle-enable-multibyte-characters' modifies binary data when
turning on the multibyteness of a buffer.  So, I made changes in
nntp.el in the Gnus trunk so that it makes a process buffer
unibyte.  I also modified the nntp functions that copy data from
a unibyte buffer to a multibyte buffer.

Regards,

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-15 10:20       ` Katsumi Yamaoka
@ 2007-11-15 11:08         ` Kenichi Handa
  2007-11-15 11:41           ` Katsumi Yamaoka
  2007-11-15 15:22           ` Stefan Monnier
  0 siblings, 2 replies; 23+ messages in thread
From: Kenichi Handa @ 2007-11-15 11:08 UTC (permalink / raw)
  To: Katsumi Yamaoka; +Cc: ding, emacs-devel

In article <b4m1war6ca5.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes:

>>>>>> Katsumi Yamaoka wrote:
> > BTW, I found another problem with Emacs 21 (Gnus still supports
> > Emacs 21, IIUC).  So, I'll go on looking into it further.

> I realized a network process that is created by
> `open-network-stream' in Emacs 21 breaks encoded non-ASCII group
> names if the process buffer is in the multibyte mode even if the
> process coding system is binary.  It behaves as if
> `toggle-enable-multibyte-characters' modifies binary data when
> turning on the multibyteness of a buffer.

If "modifies" means that 8-bit bytes are converted to
multibyte characters as what string-as-multibyte does, it's
an expected behaviour.

I long ago proposed a facility that turns on the
multibyteness of a buffer while converting 8-bit bytes to
multibyte characters as what string-to-multibyte does, but
not accepted.

---
Kenichi Handa
handa@ni.aist.go.jp

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-15 11:08         ` Kenichi Handa
@ 2007-11-15 11:41           ` Katsumi Yamaoka
  2007-11-15 14:41             ` Kenichi Handa
  2007-11-15 15:22           ` Stefan Monnier
  1 sibling, 1 reply; 23+ messages in thread
From: Katsumi Yamaoka @ 2007-11-15 11:41 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: ding, emacs-devel

>>>>> Kenichi Handa wrote:
> In article <b4m1war6ca5.fsf@jpl.org>,
>	Katsumi Yamaoka <yamaoka@jpl.org> writes:

>> I realized a network process that is created by
>> `open-network-stream' in Emacs 21 breaks encoded non-ASCII group
>> names if the process buffer is in the multibyte mode even if the
>> process coding system is binary.  It behaves as if
>> `toggle-enable-multibyte-characters' modifies binary data when
>> turning on the multibyteness of a buffer.

(The changes that I made in nntp.el has been archived in
 <URL:http://article.gmane.org/gmane.emacs.gnus.commits/5519>.)

> If "modifies" means that 8-bit bytes are converted to
> multibyte characters as what string-as-multibyte does, it's
> an expected behaviour.

What I observed was different.  The group name "テスト" is
encoded by utf-8 by the nntp server into:

"\343\203\206\343\202\271\343\203\210"

After it is transferred to Gnus, in the nntp process bufer it is
modified into:

"\343\203XY\343\203\210"

Where X is (make-char 'greek-iso8859-7 99)
  and Y is (make-char 'latin-iso8859-2 57).

Since Gnus treats a group name as a unibyte string, finally it
is made into:

"\343\203\343\271\343\203\210"

> I long ago proposed a facility that turns on the
> multibyteness of a buffer while converting 8-bit bytes to
> multibyte characters as what string-to-multibyte does, but
> not accepted.

But the modern Emacsen does do so, doesn't it?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-15 11:41           ` Katsumi Yamaoka
@ 2007-11-15 14:41             ` Kenichi Handa
  2007-11-15 23:31               ` Katsumi Yamaoka
  0 siblings, 1 reply; 23+ messages in thread
From: Kenichi Handa @ 2007-11-15 14:41 UTC (permalink / raw)
  To: Katsumi Yamaoka; +Cc: ding, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1611 bytes --]

In article <b4moddv20sy.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes:

> > If "modifies" means that 8-bit bytes are converted to
> > multibyte characters as what string-as-multibyte does, it's
> > an expected behaviour.

> What I observed was different.  The group name "テスト" is
> encoded by utf-8 by the nntp server into:

> "\343\203\206\343\202\271\343\203\210"

> After it is transferred to Gnus, in the nntp process bufer it is
> modified into:

> "\343\203XY\343\203\210"

> Where X is (make-char 'greek-iso8859-7 99)
>   and Y is (make-char 'latin-iso8859-2 57).

That is exactly what string-as-multibyte does. \206\343 and
\202\271 are valid multibyte forms in the current Emacs,
thus are treated as multibyte characters.

> Since Gnus treats a group name as a unibyte string, finally it
> is made into:

> "\343\203\343\271\343\203\210"

It seems that gnus treats "\343\203XY\343\203\210" as
unibyte by converting it by string-make-unibyte.

Please try this:

(string-make-unibyte
 (string-as-multibyte "\343\203\206\343\202\271\343\203\210"))

You'll get the above result, ... yes, very weird.

On the other hand,

(string-as-unibyte
 (string-as-multibyte "\343\203\206\343\202\271\343\203\210"))
 =>  "\343\203\206\343\202\271\343\203\210"

> > I long ago proposed a facility that turns on the
> > multibyteness of a buffer while converting 8-bit bytes to
> > multibyte characters as what string-to-multibyte does, but
> > not accepted.

> But the modern Emacsen does do so, doesn't it?

No.

---
Kenichi Handa
handa@ni.aist.go.jp

[-- Attachment #2: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-15 11:08         ` Kenichi Handa
  2007-11-15 11:41           ` Katsumi Yamaoka
@ 2007-11-15 15:22           ` Stefan Monnier
  2007-11-16  0:29             ` Kenichi Handa
  2007-11-16 10:50             ` Eli Zaretskii
  1 sibling, 2 replies; 23+ messages in thread
From: Stefan Monnier @ 2007-11-15 15:22 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: Katsumi Yamaoka, ding, emacs-devel

> If "modifies" means that 8-bit bytes are converted to
> multibyte characters as what string-as-multibyte does, it's
> an expected behaviour.

99% of the uses of string-as-multibyte are bugs.


        Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-15 14:41             ` Kenichi Handa
@ 2007-11-15 23:31               ` Katsumi Yamaoka
  2007-11-16  0:51                 ` Kenichi Handa
  0 siblings, 1 reply; 23+ messages in thread
From: Katsumi Yamaoka @ 2007-11-15 23:31 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: ding, emacs-devel

>>>>> Kenichi Handa <handa@ni.aist.go.jp> wrote:

> In article <b4moddv20sy.fsf@jpl.org>,
>	Katsumi Yamaoka <yamaoka@jpl.org> writes:

>> What I observed was different.

> That is exactly what string-as-multibyte does. \206\343 and
> \202\271 are valid multibyte forms in the current Emacs,
> thus are treated as multibyte characters.

I understood why such readable characters appeared abruptly.

[...]

> Please try this:

> (string-make-unibyte
>  (string-as-multibyte "\343\203\206\343\202\271\343\203\210"))

> You'll get the above result, ... yes, very weird.

Oh, it made me surprised a bit.  But I often view such a scene
while playing with unibyte and multibyte things, and it always
confuses me.

> On the other hand,

> (string-as-unibyte
>  (string-as-multibyte "\343\203\206\343\202\271\343\203\210"))
>  =>  "\343\203\206\343\202\271\343\203\210"

>>> I long ago proposed a facility that turns on the
>>> multibyteness of a buffer while converting 8-bit bytes to
>>> multibyte characters as what string-to-multibyte does, but
>>> not accepted.

>> But the modern Emacsen does do so, doesn't it?

> No.

Oops.  I misunderstood that the reason why Emacs 22 and 23 don't
break 8-bit data while they are being fed into a multibyte buffer
from a network process of which the process coding system is
binary.  So, maybe the best ways for the present are still to
use a unibyte buffer for unibyte data and to use a multibyte
buffer for multibyte data.  And use a string, not a buffer, to
encode and decode data if the multibyteness of data will change,
like:

(insert (prog1
	    (decode-coding-string (buffer-string) 'coding)
	  (erase-buffer)
	  (set-buffer-multibyte t)))



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-15 15:22           ` Stefan Monnier
@ 2007-11-16  0:29             ` Kenichi Handa
  2007-11-16 10:50             ` Eli Zaretskii
  1 sibling, 0 replies; 23+ messages in thread
From: Kenichi Handa @ 2007-11-16  0:29 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: yamaoka, ding, emacs-devel

In article <jwvd4ubttyx.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> > If "modifies" means that 8-bit bytes are converted to
> > multibyte characters as what string-as-multibyte does, it's
> > an expected behaviour.

> 99% of the uses of string-as-multibyte are bugs.

Sure.

---
Kenichi Handa
handa@ni.aist.go.jp

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-15 23:31               ` Katsumi Yamaoka
@ 2007-11-16  0:51                 ` Kenichi Handa
  2007-11-16  1:24                   ` Katsumi Yamaoka
  0 siblings, 1 reply; 23+ messages in thread
From: Kenichi Handa @ 2007-11-16  0:51 UTC (permalink / raw)
  To: Katsumi Yamaoka; +Cc: ding, emacs-devel

In article <b4my7czxeze.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes:

> Oops.  I misunderstood that the reason why Emacs 22 and 23 don't
> break 8-bit data while they are being fed into a multibyte buffer
> from a network process of which the process coding system is
> binary.  So, maybe the best ways for the present are still to
> use a unibyte buffer for unibyte data and to use a multibyte
> buffer for multibyte data.  And use a string, not a buffer, to
> encode and decode data if the multibyteness of data will change,
> like:

> (insert (prog1
> 	    (decode-coding-string (buffer-string) 'coding)
> 	  (erase-buffer)
> 	  (set-buffer-multibyte t)))

The best is to decide buffer's multibyteness just after it
is created, and don't change the multibyteness later.

---
Kenichi Handa
handa@ni.aist.go.jp

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-16  0:51                 ` Kenichi Handa
@ 2007-11-16  1:24                   ` Katsumi Yamaoka
  2007-11-16  2:51                     ` Stefan Monnier
  0 siblings, 1 reply; 23+ messages in thread
From: Katsumi Yamaoka @ 2007-11-16  1:24 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: ding, emacs-devel

>>>>> Kenichi Handa wrote:
> In article <b4my7czxeze.fsf@jpl.org>,
>	Katsumi Yamaoka <yamaoka@jpl.org> writes:

>> (insert (prog1
>> 	    (decode-coding-string (buffer-string) 'coding)
>> 	  (erase-buffer)
>> 	  (set-buffer-multibyte t)))

> The best is to decide buffer's multibyteness just after it
> is created, and don't change the multibyteness later.

I see.  In relation to this, I've been wanting to exterminate
the `mm-with-unibyte-current-buffer' macro that Gnus uses here
and there (if you have time, please look at how it is evil, in
mm-util.el).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-16  1:24                   ` Katsumi Yamaoka
@ 2007-11-16  2:51                     ` Stefan Monnier
  0 siblings, 0 replies; 23+ messages in thread
From: Stefan Monnier @ 2007-11-16  2:51 UTC (permalink / raw)
  To: Katsumi Yamaoka; +Cc: Kenichi Handa, ding, emacs-devel

> I see.  In relation to this, I've been wanting to exterminate
> the `mm-with-unibyte-current-buffer' macro that Gnus uses here
> and there (if you have time, please look at how it is evil, in
> mm-util.el).

Yes, I spotted it a while ago already (I'm using a few local hacks to
try and catch some multi/unibyte abuses so I tend to bump into bugs
a bit earlier than in normal use).

I think a mistake of Emacs's handling of encoding issues is that we use
"unibyte" and "multibyte" rather than "byteS" and chars".


        Stefan


PS: Here are some hunks from my local changes.


@@ -1034,16 +1068,18 @@
 (defmacro mm-with-unibyte-buffer (&rest forms)
   "Create a temporary buffer, and evaluate FORMS there like `progn'.
 Use unibyte mode for this."
-  `(let (default-enable-multibyte-characters)
-     (with-temp-buffer ,@forms)))
+  `(with-temp-buffer
+     (mm-disable-multibyte)
+     ,@forms))
 (put 'mm-with-unibyte-buffer 'lisp-indent-function 0)
 (put 'mm-with-unibyte-buffer 'edebug-form-spec '(body))
 
 (defmacro mm-with-multibyte-buffer (&rest forms)
   "Create a temporary buffer, and evaluate FORMS there like `progn'.
 Use multibyte mode for this."
-  `(let ((default-enable-multibyte-characters t))
-     (with-temp-buffer ,@forms)))
+  `(with-temp-buffer
+     (mm-enable-multibyte)
+     ,@forms))
 (put 'mm-with-multibyte-buffer 'lisp-indent-function 0)
 (put 'mm-with-multibyte-buffer 'edebug-form-spec '(body))
 
@@ -1058,24 +1094,29 @@
 harmful since it is likely to modify existing data in the buffer.
 For instance, it converts \"\\300\\255\" into \"\\255\" in
 Emacs 23 (unicode)."
-  (let ((multibyte (make-symbol "multibyte"))
-	(buffer (make-symbol "buffer")))
-    `(if mm-emacs-mule
-	 (let ((,multibyte enable-multibyte-characters)
-	       (,buffer (current-buffer)))
-	   (unwind-protect
-	       (let (default-enable-multibyte-characters)
-		 (set-buffer-multibyte nil)
-		 ,@forms)
-	     (set-buffer ,buffer)
-	     (set-buffer-multibyte ,multibyte)))
-       (let (default-enable-multibyte-characters)
-	 ,@forms))))
+  (message "Braindeadly defined macro: mm-with-unibyte-current-buffer")
+  ;; (let ((multibyte (make-symbol "multibyte"))
+  ;;       (buffer (make-symbol "buffer")))
+  ;;   `(if mm-emacs-mule
+  ;;        (let ((,multibyte enable-multibyte-characters)
+  ;;              (,buffer (current-buffer)))
+  ;;          (unwind-protect
+  ;;              (let (default-enable-multibyte-characters)
+  ;;       	 (set-buffer-multibyte nil)
+  ;;       	 ,@forms)
+  ;;            (set-buffer ,buffer)
+  ;;            (set-buffer-multibyte ,multibyte)))
+  ;;      (let (default-enable-multibyte-characters)
+  ;;        ,@forms)))
+  `(progn (assert (not enable-multibyte-characters))
+          ,@forms)
+  )
 (put 'mm-with-unibyte-current-buffer 'lisp-indent-function 0)
 (put 'mm-with-unibyte-current-buffer 'edebug-form-spec '(body))
 
 (defmacro mm-with-unibyte (&rest forms)
   "Eval the FORMS with the default value of `enable-multibyte-characters' nil."
+  (message "Braindead macro: mm-with-unibyte")
   `(let (default-enable-multibyte-characters)
      ,@forms))
 (put 'mm-with-unibyte 'lisp-indent-function 0)
@@ -1083,6 +1124,7 @@
 
 (defmacro mm-with-multibyte (&rest forms)
   "Eval the FORMS with the default value of `enable-multibyte-characters' t."
+  (message "Braindead macro: mm-with-multibyte")
   `(let ((default-enable-multibyte-characters t))
      ,@forms))
 (put 'mm-with-multibyte 'lisp-indent-function 0)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Unicode-2] `read' always returns multibyte symbol
  2007-11-15 15:22           ` Stefan Monnier
  2007-11-16  0:29             ` Kenichi Handa
@ 2007-11-16 10:50             ` Eli Zaretskii
  1 sibling, 0 replies; 23+ messages in thread
From: Eli Zaretskii @ 2007-11-16 10:50 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: yamaoka, handa, ding, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Thu, 15 Nov 2007 10:22:12 -0500
> Cc: Katsumi Yamaoka <yamaoka@jpl.org>, ding@gnus.org, emacs-devel@gnu.org
> 
> 99% of the uses of string-as-multibyte are bugs.

Should we emit a warning from the byte compiler about that?  (Sorry if
we already do: I didn't have time to look.)

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2007-11-16 10:50 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-13  9:41 [Unicode-2] `read' always returns multibyte symbol Katsumi Yamaoka
2007-11-13 12:55 ` Kenichi Handa
2007-11-13 15:10   ` Stefan Monnier
2007-11-14  4:53     ` Kenichi Handa
2007-11-14  3:56   ` Katsumi Yamaoka
2007-11-14 11:39     ` Katsumi Yamaoka
2007-11-14 14:52       ` Stefan Monnier
2007-11-14 23:52         ` Katsumi Yamaoka
2007-11-15  1:15           ` Stefan Monnier
2007-11-15  3:01             ` Katsumi Yamaoka
2007-11-15  3:39               ` Stefan Monnier
2007-11-15 10:20       ` Katsumi Yamaoka
2007-11-15 11:08         ` Kenichi Handa
2007-11-15 11:41           ` Katsumi Yamaoka
2007-11-15 14:41             ` Kenichi Handa
2007-11-15 23:31               ` Katsumi Yamaoka
2007-11-16  0:51                 ` Kenichi Handa
2007-11-16  1:24                   ` Katsumi Yamaoka
2007-11-16  2:51                     ` Stefan Monnier
2007-11-15 15:22           ` Stefan Monnier
2007-11-16  0:29             ` Kenichi Handa
2007-11-16 10:50             ` Eli Zaretskii
2007-11-13 15:07 ` Stefan Monnier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).