Re: imap breaks latin-1 characters

Gnus development mailing list
 help / color / mirror / Atom feed

* Re: imap breaks latin-1 characters
       [not found]   ` <iluvgvtn673.fsf@barbar.josefsson.org>
@ 2000-09-18 15:17     ` ShengHuo ZHU
  2000-09-18 19:10       ` Simon Josefsson
  0 siblings, 1 reply; 18+ messages in thread
From: ShengHuo ZHU @ 2000-09-18 15:17 UTC (permalink / raw)
  Cc: ding

Simon Josefsson <simon@josefsson.org> writes:

> > Simon, the following code may illustrate the bug.
> > 
> > (let (s (coding-system-for-write 'binary))
> >   (mm-with-unibyte-buffer
> >     (insert "0123" 223 "ABCD")
> >     (setq s (buffer-string)))
> >   (with-temp-file "test"
> >   ;;  (mm-disable-multibyte)
> >     (insert s)))
> > 
> > String `s' is unibyte. When it is inserted to multibyte "test" buffer,
> > Emacs convert s into multibyte.  The ugly thing is that all \240-\377
> > chars are converted into latin-iso8859-1, where those \201's come
> > from.  
> 
> Ouch.
> 
> Perhaps (insert (string-as-multibyte (concat "0123" 223 "ABCD")))
> could be used instead, so we don't have to disable multibyte in the
> buffer. IMHO switching multibyte on and off in various parts of
> imap/nnimap/mail-source/Gnus is causing part of the headache in the
> first place. But it's probably more consistent with how things work so
> there's probably no point in changing it.

I thought so, but I change my mind after recently I read some MULE
code of Emacs.  IMHO, switching multibyte on and off causes the things
complicated and is not the `right thing' to do.  That results
inserting unibyte strings into multibyte buffers, converting \240-\377
into latin-iso8859-1, then doubly decoding.  

The behavior of unibyte-char-to-multibyte may also hide some bugs.
For example, some NNTP servers may have some groups with
latin-iso8859-1 chars in their names.  The names are simply insert the
group buffer without decoding. Those latin chars show. It seems
working, but not for other charset.

Now I suggest removing all unibyte-buffer in Gnus (maybe in oGnus).
All binary data would exist as multibyte strings, though it may not
work with early Emacs 20 MULE bugs (why should we bother?).

ShengHuo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-18 15:17     ` imap breaks latin-1 characters ShengHuo ZHU
@ 2000-09-18 19:10       ` Simon Josefsson
  2000-09-18 19:58         ` ShengHuo ZHU
  0 siblings, 1 reply; 18+ messages in thread
From: Simon Josefsson @ 2000-09-18 19:10 UTC (permalink / raw)
  Cc: ding

ShengHuo ZHU <zsh@cs.rochester.edu> writes:

> Simon Josefsson <simon@josefsson.org> writes:
>
> > Perhaps (insert (string-as-multibyte (concat "0123" 223 "ABCD")))
> > could be used instead, so we don't have to disable multibyte in the
> > buffer. IMHO switching multibyte on and off in various parts of
> > imap/nnimap/mail-source/Gnus is causing part of the headache in the
> > first place. But it's probably more consistent with how things work so
> > there's probably no point in changing it.
> 
> I thought so, but I change my mind after recently I read some MULE
> code of Emacs.  IMHO, switching multibyte on and off causes the things
> complicated and is not the `right thing' to do.

I agree fully.

> The behavior of unibyte-char-to-multibyte may also hide some bugs.
> For example, some NNTP servers may have some groups with
> latin-iso8859-1 chars in their names.  The names are simply insert the
> group buffer without decoding. Those latin chars show. It seems
> working, but not for other charset.

This can never work properly, so I'm not sure we should care.  NNTP
doesn't support character set tagging.  All we can provide is
intelligent defaults and a possibility of telling Gnus what charset
group names are in.

> Now I suggest removing all unibyte-buffer in Gnus (maybe in oGnus).

oGnus please, imho.

A silly question: how do you convert a binary raw multibyte string
into a multibyte string of a named charset?  Is there a API to do
this?  Do you need to go over the string character by character and
(make-char CHARSET c)?

I'm concerned with efficiency.  Large attachments really kill Gnus
performance today.  Reducing number of insert's, buffer-substring's,
encoding, decoding and stuff would be nice.

(Of course, attachments should ideally not be fetched at all unless
requested, but that's another story.)

> All binary data would exist as multibyte strings, though it may not
> work with early Emacs 20 MULE bugs (why should we bother?).

We shouldn't bother.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-18 19:10       ` Simon Josefsson
@ 2000-09-18 19:58         ` ShengHuo ZHU
  2000-09-18 20:31           ` Simon Josefsson
  0 siblings, 1 reply; 18+ messages in thread
From: ShengHuo ZHU @ 2000-09-18 19:58 UTC (permalink / raw)

Simon Josefsson <simon@josefsson.org> writes:

> oGnus please, imho.

I think so.  I don't want to make any trouble in the beta version.

> A silly question: how do you convert a binary raw multibyte string
> into a multibyte string of a named charset?  Is there a API to do
> this?  Do you need to go over the string character by character and
> (make-char CHARSET c)?

decode-coding-string is the answer.  Is it so simple?  You probably
think about those "\201" bugs.  Let's see examples,

(with-temp-buffer
    (insert "1234\337ABCD")
    (decode-coding-region (point-min) (point-max) 'iso-8859-1)
    (buffer-string))

(with-temp-buffer
    (insert (string-as-multibyte "1234\337ABCD"))
    (decode-coding-region (point-min) (point-max) 'iso-8859-1)
    (buffer-string))

In first case, it returns a string contains "\201", because "\337" of
the unibyte string "1234\337ABCD" is converted into a latin-iso8859-1
character instead of the binary one.  Therefore, doubly decoded.

In the second case, string-as-multibyte did the trick, but it is not a
total solution, especially for those cases with \200-\237 in the
string.  For example, (string-as-multibyte "1234\201\337ABCD") returns
a string with one latin-iso8859-1 character instead of two 8-bit
characters.

Probably, the question is how to convert a binary raw unibyte string
into a binary raw multibyte string (instead of latin-iso8859-1).  I
don't know any efficient solution.  Therefore I suggest to avoid all
unibyte buffers and non-ascii unibyte strings.

> I'm concerned with efficiency.  Large attachments really kill Gnus
> performance today.  Reducing number of insert's, buffer-substring's,
> encoding, decoding and stuff would be nice.

Right, too many temporary buffers are involved in mm-get-part and
mm-insert-part.

> (Of course, attachments should ideally not be fetched at all unless
> requested, but that's another story.)

It sounds like message/external.  Would adding a fetch callback
function into the handle structure help?

ShengHuo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-18 19:58         ` ShengHuo ZHU
@ 2000-09-18 20:31           ` Simon Josefsson
  2000-09-18 22:34             ` ShengHuo ZHU
  2000-09-18 22:42             ` Kai Großjohann
  0 siblings, 2 replies; 18+ messages in thread
From: Simon Josefsson @ 2000-09-18 20:31 UTC (permalink / raw)
  Cc: ding

ShengHuo ZHU <zsh@cs.rochester.edu> writes:

> decode-coding-string is the answer.

Fine, thanks.

> In first case, it returns a string contains "\201", because "\337" of
> the unibyte string "1234\337ABCD" is converted into a latin-iso8859-1
> character instead of the binary one.  Therefore, doubly decoded.
> 
> In the second case, string-as-multibyte did the trick, but it is not a
> total solution, especially for those cases with \200-\237 in the
> string.  For example, (string-as-multibyte "1234\201\337ABCD") returns
> a string with one latin-iso8859-1 character instead of two 8-bit
> characters.

Ok, I see the problems.  Will this ever be fixed in Emacs 20.x, or is
the answer to wait for Emacs 21 here?  Replacing unibyte buffers with
string-as-multibyte's is of no use if string-as-multibyte is buggy.

More problems: Press C-x C-e in the article buffer to evaluate your
examples.  I get two \201's in the echo area from the first example
and one \201 from the second.  Why?!  The echo area _is_ a multibyte
buffer, isn't it?

I've seen \201's in the echo area before (BBDB) but never elsewhere,
this might be the same issue.  I've no idea how to debug this.

> Probably, the question is how to convert a binary raw unibyte string
> into a binary raw multibyte string (instead of latin-iso8859-1).  I
> don't know any efficient solution.  Therefore I suggest to avoid all
> unibyte buffers and non-ascii unibyte strings.

Yes, I agree with this.

> > (Of course, attachments should ideally not be fetched at all unless
> > requested, but that's another story.)
> 
> It sounds like message/external.  Would adding a fetch callback
> function into the handle structure help?

I think a new backend interface is needed here.  The mm-* function
could be modified to work with a MIME structure instead of the raw
mail, and whatever body parts are displayed invoke a fetch callback
that fetch that part.

More oGnus stuff.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-18 20:31           ` Simon Josefsson
@ 2000-09-18 22:34             ` ShengHuo ZHU
  2000-09-19  9:27               ` Simon Josefsson
  2000-09-21 23:10               ` Dave Love
  2000-09-18 22:42             ` Kai Großjohann
  1 sibling, 2 replies; 18+ messages in thread
From: ShengHuo ZHU @ 2000-09-18 22:34 UTC (permalink / raw)


Simon Josefsson <simon@josefsson.org> writes:

> > In the second case, string-as-multibyte did the trick, but it is not a
> > total solution, especially for those cases with \200-\237 in the
> > string.  For example, (string-as-multibyte "1234\201\337ABCD") returns
> > a string with one latin-iso8859-1 character instead of two 8-bit
> > characters.
> 
> Ok, I see the problems.  Will this ever be fixed in Emacs 20.x, or is
> the answer to wait for Emacs 21 here?  Replacing unibyte buffers with
> string-as-multibyte's is of no use if string-as-multibyte is buggy.

It is probably a feature instead of a bug. But the document "contains
an individual 8-bit byte (i.e. not part of multibyte form)" is
confusing.

> More problems: Press C-x C-e in the article buffer to evaluate your
> examples.  I get two \201's in the echo area from the first example
> and one \201 from the second.  Why?!  The echo area _is_ a multibyte
> buffer, isn't it?

> I've seen \201's in the echo area before (BBDB) but never elsewhere,
> this might be the same issue.  I've no idea how to debug this.

Wait a second.  I see `1234\201ßABCD' from the first case and
`1234ßABCD' from the second in both Emacs 20.7 and 21.0.90.  Did you
see EXACTLY?

\201 could show in a multibyte buffer. For example, (insert 129).

> I think a new backend interface is needed here.  The mm-* function
> could be modified to work with a MIME structure instead of the raw
> mail, and whatever body parts are displayed invoke a fetch callback
> that fetch that part.

Sounds interesting.

-- 
ShengHuo



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-18 22:34             ` ShengHuo ZHU
@ 2000-09-19  9:27               ` Simon Josefsson
  2000-09-19 12:43                 ` ShengHuo ZHU
  2000-09-21 23:10               ` Dave Love
  1 sibling, 1 reply; 18+ messages in thread
From: Simon Josefsson @ 2000-09-19  9:27 UTC (permalink / raw)
  Cc: ding

ShengHuo ZHU <zsh@cs.rochester.edu> writes:

> > > In the second case, string-as-multibyte did the trick, but it is not a
> > > total solution, especially for those cases with \200-\237 in the
> > > string.  For example, (string-as-multibyte "1234\201\337ABCD") returns
> > > a string with one latin-iso8859-1 character instead of two 8-bit
> > > characters.
> > 
> > Ok, I see the problems.  Will this ever be fixed in Emacs 20.x, or is
> > the answer to wait for Emacs 21 here?  Replacing unibyte buffers with
> > string-as-multibyte's is of no use if string-as-multibyte is buggy.
> 
> It is probably a feature instead of a bug. But the document "contains
> an individual 8-bit byte (i.e. not part of multibyte form)" is
> confusing.

But won't this cause problems for us?  If we replace unibyte buffers
with string-as-multibyte where necessery (which I agree with), we'll
mutilate mail that contain \200-\237 character?

Since both (string-as-multibyte "1234\337ABCD") and
(string-as-multibyte "1234\201\337ABCD") look the same, I think we're
in trouble.

Or is there a (string-as-multibyte-foo "1234\201\337ABCD") that return
a multibyte string that will display as 1234\201ßABCD?  Isn't that
what we need?

> > More problems: Press C-x C-e in the article buffer to evaluate your
> > examples.  I get two \201's in the echo area from the first example
> > and one \201 from the second.  Why?!  The echo area _is_ a multibyte
> > buffer, isn't it?
> 
> > I've seen \201's in the echo area before (BBDB) but never elsewhere,
> > this might be the same issue.  I've no idea how to debug this.
> 
> Wait a second.  I see `1234\201ßABCD' from the first case and
> `1234ßABCD' from the second in both Emacs 20.7 and 21.0.90.  Did you
> see EXACTLY?

Now I don't get that behaviour, but I'm sure I did last time.
Interesting.  Ok, now I get it again.  All I did was to do something
else for a while.  I'll try to narrow it down to specific commands.  I
first thought it was the backlog, but it doesn't seem to be.

First case: `1234\201\201ßABCD' Second case: `1234\201ßABCD'. In echo
area.  A freshly started emacs/gnus display one less \201 in both
cases.  In the *scratch* buffer I always seem to get `1234\201ßABCD'
and `1234ßABCD' respectively.

Emacs 20.7.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-19  9:27               ` Simon Josefsson
@ 2000-09-19 12:43                 ` ShengHuo ZHU
  2000-09-21 23:13                   ` Dave Love
  0 siblings, 1 reply; 18+ messages in thread
From: ShengHuo ZHU @ 2000-09-19 12:43 UTC (permalink / raw)


Simon Josefsson <simon@josefsson.org> writes:

[...]

> Or is there a (string-as-multibyte-foo "1234\201\337ABCD") that return
> a multibyte string that will display as 1234\201ßABCD?  Isn't that
> what we need?

string-as-multibyte is something acting like type casting instead of
literally converting.  

We need something convert it to 8bit multibyte string
"1234\201\337ABCD", not "1234\201ßABCD" which contains
latin-iso8859-1.

[...]

> First case: `1234\201\201ßABCD' Second case: `1234\201ßABCD'. In echo
> area.  A freshly started emacs/gnus display one less \201 in both
> cases.  In the *scratch* buffer I always seem to get `1234\201ßABCD'
> and `1234ßABCD' respectively.

There could be a bug somewhere.

ShengHuo



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-19 12:43                 ` ShengHuo ZHU
@ 2000-09-21 23:13                   ` Dave Love
  2000-09-22  0:18                     ` ShengHuo ZHU
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Love @ 2000-09-21 23:13 UTC (permalink / raw)


>>>>> "ZSH" == ShengHuo ZHU <zsh@cs.rochester.edu> writes:

 ZSH> We need something convert it to 8bit multibyte string
 ZSH> "1234\201\337ABCD", not "1234\201ßABCD" which contains
 ZSH> latin-iso8859-1.

I don't understand this.  For raw bytes, you're talking unibyte
(raw-text/binary) and I'd have thought that was what you'd want to
convert _from_.

I'm quite confused by this thread.  I'm not sure I ever really
followed what happens from the server through to display, but it did
seem convoluted when I looked.  I don't have time to go through it now
either.

We should be able to get Handa to advise, though, especially with Gnus
5.8 in the Emacs sources now.  I guess he will be interested in things
that stress the Mule features.

Can you sketch what happens in Gnus, what the problems are exactly and
what features you think are needed to avoid them?  I think it's too
late for new features in Mule 5.0, though.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-21 23:13                   ` Dave Love
@ 2000-09-22  0:18                     ` ShengHuo ZHU
  2000-09-22 14:56                       ` Dave Love
  2000-09-26  2:39                       ` Kenichi Handa
  0 siblings, 2 replies; 18+ messages in thread
From: ShengHuo ZHU @ 2000-09-22  0:18 UTC (permalink / raw)
  Cc: Kenichi Handa

Dave Love <d.love@dl.ac.uk> writes:

> >>>>> "ZSH" == ShengHuo ZHU <zsh@cs.rochester.edu> writes:
> 
>  ZSH> We need something convert it to 8bit multibyte string
>  ZSH> "1234\201\337ABCD", not "1234\201ßABCD" which contains
>  ZSH> latin-iso8859-1.
> 
> I don't understand this.  For raw bytes, you're talking unibyte
> (raw-text/binary) and I'd have thought that was what you'd want to
> convert _from_.
>
> I'm quite confused by this thread.  I'm not sure I ever really
> followed what happens from the server through to display, but it did
> seem convoluted when I looked.  I don't have time to go through it now
> either.
> 
> We should be able to get Handa to advise, though, especially with Gnus
> 5.8 in the Emacs sources now.  I guess he will be interested in things
> that stress the Mule features.
> 
> Can you sketch what happens in Gnus, what the problems are exactly and
> what features you think are needed to avoid them?  I think it's too
> late for new features in Mule 5.0, though.

The problems discussed are handling unibyte string or buffer.  Unibyte
buffer was introduced in Gnus, partially because early Emacs 20 could
not handle 8bit data properly.  Anyway, unibyte buffers and strings
are used in Gnus 5.8 and unlikely going to be removed before Gnus 5.9
is released.

I found the most of these problems are related to
unibyte-char-to-multibyte or so.  For example,

  (char-charset (unibyte-char-to-multibyte ?\337)) => latin-iso8859-1,

which means 8bit unibyte characters (\240-\377) are converted to
latin-iso8859-1 characters instead of eight-bit-graphic ones (see
DEFAULT_NONASCII_INSERT_OFFSET in the Emacs source).  I guess this
setting is because of the compatibility.

Now, suppose to insert an encoded (unibyte) string (maybe from some
unibyte buffer) into a multibyte buffer, then decode it.  The string
is garbled after inserting the buffer.  For example, you may get
different results from these two examples (with Mule-UCS), even in the
current Emacs 21.0.90.

(decode-coding-string "\346\226\207" 'utf-8)

(with-temp-buffer
    (insert "\346\226\207")
    (decode-coding-region (point-min) (point-max) 'utf-8)
    (buffer-string))

Another pair of examples, which results a "\201".

(decode-coding-string "\337" 'iso-8859-1)

(with-temp-buffer
    (insert "\337")
    (decode-coding-region (point-min) (point-max) 'iso-8859-1)
    (buffer-string))

Or 

(decode-coding-string "\244\244" 'big5)

(with-temp-buffer
    (insert "\244\244")
    (decode-coding-region (point-min) (point-max) 'big5)
    (buffer-string))

ShengHuo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-22  0:18                     ` ShengHuo ZHU
@ 2000-09-22 14:56                       ` Dave Love
  2000-09-22 16:34                         ` Kai Großjohann
  2000-09-22 17:44                         ` ShengHuo ZHU
  2000-09-26  2:39                       ` Kenichi Handa
  1 sibling, 2 replies; 18+ messages in thread
From: Dave Love @ 2000-09-22 14:56 UTC (permalink / raw)


>>>>> "ZSH" == ShengHuo ZHU <zsh@cs.rochester.edu> writes:

 ZSH> The problems discussed are handling unibyte string or buffer.
 ZSH> Unibyte buffer was introduced in Gnus, partially because early
 ZSH> Emacs 20 could not handle 8bit data properly.

So this is just a problem with the Gnus code?

 ZSH> which means 8bit unibyte characters (\240-\377) are converted to
 ZSH> latin-iso8859-1 characters instead of eight-bit-graphic ones
 ZSH> (see DEFAULT_NONASCII_INSERT_OFFSET in the Emacs source).

Presumably you can bind `nonascii-translation-table' if necessary.  It
seems reasonable to be able to use a zero `nonascii-insert-offset'
now.  Feel free to suggest it if that will solve a problem.

 ZSH> Now, suppose to insert an encoded (unibyte) string (maybe from
 ZSH> some unibyte buffer) into a multibyte buffer, then decode it.

Decoding in a unibyte buffer seems a better idea if it needs to be
done explicitly.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-22 14:56                       ` Dave Love
@ 2000-09-22 16:34                         ` Kai Großjohann
  2000-09-22 17:55                           ` ShengHuo ZHU
  2000-09-25 11:23                           ` Dave Love
  2000-09-22 17:44                         ` ShengHuo ZHU
  1 sibling, 2 replies; 18+ messages in thread
From: Kai Großjohann @ 2000-09-22 16:34 UTC (permalink / raw)
  Cc: ding

On Fri, 22 Sep 2000, Dave Love wrote:

>>>>>> "ZSH" == ShengHuo ZHU <zsh@cs.rochester.edu> writes:
> 
>  ZSH> Now, suppose to insert an encoded (unibyte) string (maybe from
>  ZSH> some unibyte buffer) into a multibyte buffer, then decode it.
> 
> Decoding in a unibyte buffer seems a better idea if it needs to be
> done explicitly.

What if the result of decoding is multibyte?  Does that automagically
convert the unibyte buffer to multibyte?

IIUC, the IMAP server sends a number of arbitrary bytes.  Gnus needs
to (temporarily) store them without any modification.  Then Gnus finds
out what is the encoding of those bytes and decodes the bytes
according to the encoding.  (Can it happen that the IMAP server now
sends a byte sequence which is the result of the foo encoding, and in
a few minutes it will send a byte sequence which is the result of a
different encoding?  I suppose that this can happen, but I'm not
sure.  For example, the IMAP folder names could be iso-8859-1 encoded
but the message content is ShiftJIS or whatever.)

Also IIUC, a multibyte buffer cannot hold any arbitrary byte sequence
without modification, so the buffer that receives the bytes from the
IMAP server is a unibyte buffer.  So, how do we get from the byte
sequence to the decoded character sequence?

Always eager to learn more about Mule,
kai
-- 
I like BOTH kinds of music.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-22 16:34                         ` Kai Großjohann
@ 2000-09-22 17:55                           ` ShengHuo ZHU
  2000-09-25 11:23                           ` Dave Love
  1 sibling, 0 replies; 18+ messages in thread
From: ShengHuo ZHU @ 2000-09-22 17:55 UTC (permalink / raw)


Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:

> Also IIUC, a multibyte buffer cannot hold any arbitrary byte
> sequence without modification, so the buffer that receives the bytes
> from the IMAP server is a unibyte buffer.  So, how do we get from
> the byte sequence to the decoded character sequence?

A multibyte buffer in Emacs 21.0.90 (Mule 5.0) can hold.  But Emacs
20.7 or earlier can't. Try (insert ?\201 ?\337).

Therefore, we can not expect Gnus doing this without using unibyte
buffers before Emacs 21 is prevalent.

ShengHuo



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-22 16:34                         ` Kai Großjohann
  2000-09-22 17:55                           ` ShengHuo ZHU
@ 2000-09-25 11:23                           ` Dave Love
  1 sibling, 0 replies; 18+ messages in thread
From: Dave Love @ 2000-09-25 11:23 UTC (permalink / raw)


>>>>> "KG" == Kai Großjohann <Kai.Grossjohann@CS.Uni-Dortmund.DE> writes:

 KG> What if the result of decoding is multibyte?

??

(let (default-enable-multibyte-characters)
  (multibyte-string-p (decode-coding-string "a\255" 'latin-1)))
  => t

 KG> Also IIUC, a multibyte buffer cannot hold any arbitrary byte
 KG> sequence without modification,

That's clearly not true.  The issue is how you interpret the bytes as
characters.

 KG> so the buffer that receives the bytes from the IMAP server is a
 KG> unibyte buffer.  So, how do we get from the byte sequence to the
 KG> decoded character sequence?

With `decode-coding-region'.  If you want to decode it with a coding
system which encodes all octets, that should work OK even in Emacs 20
(e.g. codepage.el).

 KG> Always eager to learn more about Mule,

If you find the the doc (plus examples in the Emacs source of what you
want to do) deficient, _please_ report it.  Mule wizards presumably
hang out on the English Mule list, but I'm not one of them and I can't
spend time on things which aren't going to improve the release, sorry.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-22 14:56                       ` Dave Love
  2000-09-22 16:34                         ` Kai Großjohann
@ 2000-09-22 17:44                         ` ShengHuo ZHU
  2000-09-25 11:25                           ` Dave Love
  1 sibling, 1 reply; 18+ messages in thread
From: ShengHuo ZHU @ 2000-09-22 17:44 UTC (permalink / raw)


Dave Love <d.love@dl.ac.uk> writes:

>  ZSH> which means 8bit unibyte characters (\240-\377) are converted to
>  ZSH> latin-iso8859-1 characters instead of eight-bit-graphic ones
>  ZSH> (see DEFAULT_NONASCII_INSERT_OFFSET in the Emacs source).
> 
> Presumably you can bind `nonascii-translation-table' if necessary.  It
> seems reasonable to be able to use a zero `nonascii-insert-offset'
> now.  Feel free to suggest it if that will solve a problem.

Without modifying unibyte_char_to_multibyte, binding
`nonascii-translation-table' would not work for this.  I think that
enabling a zero `nonascii-insert-offset' is a reasonable solution.
Anyway, to get this done, unibyte_char_to_multibyte have to be
changed.

>  ZSH> Now, suppose to insert an encoded (unibyte) string (maybe from
>  ZSH> some unibyte buffer) into a multibyte buffer, then decode it.
> 
> Decoding in a unibyte buffer seems a better idea if it needs to be
> done explicitly.

Multibyte strings (the result) could not exist in a unibyte buffer,
unless to force the buffer to be multibyte one after decoding. For
example:

    (with-temp-buffer
        (set-buffer-multibyte nil)
        (insert "\337")
        (decode-coding-region (point-min) (point-max) 'iso-8859-1)
        (set-buffer-multibyte t)
        (buffer-string))

Is this your idea?

ShengHuo



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-22 17:44                         ` ShengHuo ZHU
@ 2000-09-25 11:25                           ` Dave Love
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Love @ 2000-09-25 11:25 UTC (permalink / raw)


>>>>> "ZSH" == ShengHuo ZHU <zsh@cs.rochester.edu> writes:

 ZSH> Without modifying unibyte_char_to_multibyte, binding
 ZSH> `nonascii-translation-table' would not work for this.

In what way does it not work?

[...]

 ZSH> Is this your idea?

Yes, you should normally frob the buffer's multibyteness for coding
conversion.  I think there are examples in the Emacs source, e.g.
tar-mode IIRC; probably also Rmail.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-22  0:18                     ` ShengHuo ZHU
  2000-09-22 14:56                       ` Dave Love
@ 2000-09-26  2:39                       ` Kenichi Handa
  1 sibling, 0 replies; 18+ messages in thread
From: Kenichi Handa @ 2000-09-26  2:39 UTC (permalink / raw)
  Cc: ding, d.love, handa

ShengHuo ZHU <zsh@cs.rochester.edu> writes:
> Dave Love <d.love@dl.ac.uk> writes:
>>  Can you sketch what happens in Gnus, what the problems are exactly and
>>  what features you think are needed to avoid them?  I think it's too
>>  late for new features in Mule 5.0, though.

> The problems discussed are handling unibyte string or buffer.  Unibyte
> buffer was introduced in Gnus, partially because early Emacs 20 could
> not handle 8bit data properly.  Anyway, unibyte buffers and strings
> are used in Gnus 5.8 and unlikely going to be removed before Gnus 5.9
> is released.

> I found the most of these problems are related to
> unibyte-char-to-multibyte or so.  For example,

>   (char-charset (unibyte-char-to-multibyte ?\337)) => latin-iso8859-1,

> which means 8bit unibyte characters (\240-\377) are converted to
> latin-iso8859-1 characters instead of eight-bit-graphic ones (see
> DEFAULT_NONASCII_INSERT_OFFSET in the Emacs source).  I guess this
> setting is because of the compatibility.

> Now, suppose to insert an encoded (unibyte) string (maybe from some
> unibyte buffer) into a multibyte buffer, then decode it.  The string
> is garbled after inserting the buffer.  For example, you may get
> different results from these two examples (with Mule-UCS), even in the
> current Emacs 21.0.90.

> (decode-coding-string "\346\226\207" 'utf-8)

> (with-temp-buffer
>     (insert "\346\226\207")
>     (decode-coding-region (point-min) (point-max) 'utf-8)
>     (buffer-string))

> Another pair of examples, which results a "\201".

> (decode-coding-string "\337" 'iso-8859-1)

> (with-temp-buffer
>     (insert "\337")
>     (decode-coding-region (point-min) (point-max) 'iso-8859-1)
>     (buffer-string))

> Or 

> (decode-coding-string "\244\244" 'big5)

> (with-temp-buffer
>     (insert "\244\244")
>     (decode-coding-region (point-min) (point-max) 'big5)
>     (buffer-string))

I agree that Emacs Lisp programmers face annoying problem in
such a case.  The main reason I think is that we can not mix
multibyte region and unibyte region in a single buffer.
Thus, although docode-coding-string converts unibyte string
to multibyte string and encode-coding-string converts
multibyte string to unibyte string,
decode/encode-coding-region doesn't change the multibyteness
of the region.  Programers should pay attention to
multibyteness explicitly.  In your example, we must write as
below to get the same result as decode-coding-string.

(with-temp-buffer
  (set-buffer-multibyte nil)
  (insert "\244\244")
  (decode-coding-region (point-min) (point-max) 'big5)
  (set-buffer-multibyte t)
  (buffer-string))

The above simulates what decode-coding-string does.  Another
way is to use string-as-multibyte as below:

(with-temp-buffer
  (set-buffer-multibyte t)
  (insert (string-as-multibyte "\244\244"))
  (decode-coding-region (point-min) (point-max) 'big5)
  (buffer-string))

---
Ken'ichi HANDA
handa@etl.go.jp



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-18 22:34             ` ShengHuo ZHU
  2000-09-19  9:27               ` Simon Josefsson
@ 2000-09-21 23:10               ` Dave Love
  1 sibling, 0 replies; 18+ messages in thread
From: Dave Love @ 2000-09-21 23:10 UTC (permalink / raw)

>>>>> "ZSH" == ShengHuo ZHU <zsh@cs.rochester.edu> writes:

 ZSH> It is probably a feature instead of a bug. 

The behaviour of string-as-multibyte is certainly not a bug, just what
it does.

 ZSH> But the document "contains an individual 8-bit byte (i.e. not
 ZSH> part of multibyte form)" is confusing.

Do you mean part of the manual is confusing?  If so, please make a bug
report, though it sounds like text that won't be in the current
manual.  It's clear the Mule documentation could be better, but it
really needs feedback.

 >> I've seen \201's in the echo area before (BBDB) but never elsewhere,
 >> this might be the same issue.  I've no idea how to debug this.

BBDB doesn't consider multilingual text.  In file-coding-system-alist
I have:

 ("\\.bbdb\\'" . emacs-mule)

That definitely helps with multilingual entries, but may well not be
sufficient.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: imap breaks latin-1 characters
  2000-09-18 20:31           ` Simon Josefsson
  2000-09-18 22:34             ` ShengHuo ZHU
@ 2000-09-18 22:42             ` Kai Großjohann
  1 sibling, 0 replies; 18+ messages in thread
From: Kai Großjohann @ 2000-09-18 22:42 UTC (permalink / raw)


Hm.  I didn't fully grok the discussion, but wasn't the starting point
to parse some output from the IMAP server?  How can we tell what
encoding is used by the IMAP server?  Is the encoding used by the IMAP
server always the same, or is it possible that it prints group names
in Latin-2 (say) but message contents in Latin-1 (say)?

If we can be sure that the output from the IMAP server is always in
the same coding system, then the most natural approach would be to
provide a variable for this, defaulting to process-coding-system,
perhaps, and to use this coding system to decode output from the IMAP
server, no?

But probably, I'm missing the whole point, since I didn't really
understand what you were talking about...

kai
-- 
I like BOTH kinds of music.



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2000-09-26  2:39 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <87vgvu81n4.fsf@gnu.org>
     [not found] ` <200009181322.e8IDMYg03611@zsh.2y.net>
     [not found]   ` <iluvgvtn673.fsf@barbar.josefsson.org>
2000-09-18 15:17     ` imap breaks latin-1 characters ShengHuo ZHU
2000-09-18 19:10       ` Simon Josefsson
2000-09-18 19:58         ` ShengHuo ZHU
2000-09-18 20:31           ` Simon Josefsson
2000-09-18 22:34             ` ShengHuo ZHU
2000-09-19  9:27               ` Simon Josefsson
2000-09-19 12:43                 ` ShengHuo ZHU
2000-09-21 23:13                   ` Dave Love
2000-09-22  0:18                     ` ShengHuo ZHU
2000-09-22 14:56                       ` Dave Love
2000-09-22 16:34                         ` Kai Großjohann
2000-09-22 17:55                           ` ShengHuo ZHU
2000-09-25 11:23                           ` Dave Love
2000-09-22 17:44                         ` ShengHuo ZHU
2000-09-25 11:25                           ` Dave Love
2000-09-26  2:39                       ` Kenichi Handa
2000-09-21 23:10               ` Dave Love
2000-09-18 22:42             ` Kai Großjohann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).