In article <b4moddycwjv.fsf@jpl.org>, Katsumi Yamaoka <yamaoka@jpl.org> writes:

> The following Lisp snippet emulates what Gnus does when reading
> active data for the local.テスト newsgroup.  The buffer contains
> data which have been retrieved from the nntp server.  Note that
> the newsgroup name contains non-ASCII characters, which has been
> encoded by utf-8 in the server.

> --8<---------------cut here---------------start------------->8---
> (let ((string (encode-coding-string "local.テスト" 'utf-8)))
>   (with-temp-buffer
>     (set-buffer-multibyte t)
>     (insert (string-to-multibyte string))
>     (goto-char (point-min))
>     (multibyte-string-p (symbol-name (read (current-buffer))))))
> --8<---------------cut here---------------end--------------->8---

> While Emacs trunk returns nil for this, Emacs Unicode-2 returns t.

That is because `read' decides the name is unibyte or
multibyte by whether the name is a valid multibyte sequence
or not.  In the trunk, utf-8 byte sequecne is not a valid
multibyte sequecne, but in emacs-unicode-2, it is valid.

> If it is not intentional, I hope `read' behaves just like it does
> in Emacs trunk.

The relevant code for `read' is very complicated and I want
to avoid touching it if there's another way.

In addition, I think it is the right thing that the above
code return t; i.e. any symbol created by reading a
multibyte buffer should have a multibyte string name.  The
bug to fix is that the following code also returns t in
emacs-unicode-2.

< --8<---------------cut here---------------start------------->8---
< (let ((string (encode-coding-string "local.テスト" 'utf-8)))
<   (with-temp-buffer
<     (set-buffer-multibyte nil)
<     (insert string)
<     (goto-char (point-min))
<     (multibyte-string-p (symbol-name (read (current-buffer))))))
< --8<---------------cut here---------------end--------------->8---

> Otherwise, is there a way to make `read' return a unibyte
> symbol (without slowing down)?

The replacement of the above code is simple as this:

(multibyte-string-p (intern (encode-coding-string "local.テスト" 'utf-8)))

But, hmmm, it seems that we can't use such a code in gnus...

> In the inside of Gnus, non-ASCII group names are all treated as
> unibyte strings, that are the ones that the server has encoded
> with certain coding systems.  Because of the present behavior of
> `read' in Emacs Unicode-2, Gnus doesn't work with such newsgroups
> perfectly.  You can find the actual code in gnus-start.el as
> follows:

> --8<---------------cut here---------------start------------->8---
> ;; Read an active file and place the results in `gnus-active-hashtb'.
> (defun gnus-active-to-gnus-format (&optional method hashtb ignore-errors
> 					     real-active)
> [...]
> 	      ;; group gets set to a symbol interned in the hash table
> 	      ;; (what a hack!!) - jwz
> 	      (setq group (let ((obarray hashtb)) (read cur)))
> --8<---------------cut here---------------end--------------->8---

How about this?

(setq group
       (let ((obarray hashtb) pos)
	 (skip-syntax-forward "^w_")
	 (setq pos (point))
	 (skip-syntax-forward "w_")
	 (intern (buffer-substring pos (point)))))

I think the overhead is just several more function calls.  The
actual task (searching for a range of symbol constituents,
make string from them, and intern it) is almost the same.

---
Kenichi Handa
handa@ni.aist.go.jp