locale ISO-8859-15, UTF-8 and mail...

Gnus development mailing list
 help / color / mirror / Atom feed

* locale ISO-8859-15, UTF-8 and mail...
@ 2002-07-17 18:01 Fabien Niñoles
  2002-07-18 11:41 ` Kai Großjohann
  0 siblings, 1 reply; 6+ messages in thread
From: Fabien Niñoles @ 2002-07-17 18:01 UTC (permalink / raw)


My problem is when I try to read mail encoded in ISO-8859-1.  Gnus
doesn't seems to recognized them correctly and aren't able to trancode
them to my correct display setting (ISO-8859-15 or latin-9 if you
prefer).  Even "W M c" doesn't wash the mail correctly: the characters
(previously printed as ?) are correctly replaced but have a \201
before each occurence (whatever the encoded character is).

I'm using emacs 21.2.1, with Mule support.
on a debian GNU/Linux console with locale charmap set to ISO-8859-15.
My coding setting is simply:

(prefer-coding-system 'utf-8)
;; the next line was add recently but doesn't
;; seems to correct the problems.
(setq mm-coding-system-priorities
      '(iso-latin-1))

and the modeline usually show "--0u:" correctly, AFAIK.
Reading or writing files work also correctly.

Sorry, can't find what's wrong with that.
Thanks for your help,
Fabien
-- 
Fabien Niñoles                                               
fabien@tzone.org                         http://www.tzone.org
GPG KeyID: C15D FE9E BB35 F596 127F  BF7D 8F1F DFC9 BCE0 9436



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: locale ISO-8859-15, UTF-8 and mail...
  2002-07-17 18:01 locale ISO-8859-15, UTF-8 and mail Fabien Niñoles
@ 2002-07-18 11:41 ` Kai Großjohann
  2002-07-18 16:12   ` Fabien Niñoles
       [not found]   ` <878z49dnk3.fsf@tzone.org>
  0 siblings, 2 replies; 6+ messages in thread
From: Kai Großjohann @ 2002-07-18 11:41 UTC (permalink / raw)
  Cc: ding

fabien@tzone.org (Fabien Niñoles) writes:

> My problem is when I try to read mail encoded in ISO-8859-1.  Gnus
> doesn't seems to recognized them correctly and aren't able to trancode
> them to my correct display setting (ISO-8859-15 or latin-9 if you
> prefer).  Even "W M c" doesn't wash the mail correctly: the characters
> (previously printed as ?) are correctly replaced but have a \201
> before each occurence (whatever the encoded character is).

There is a file latin1-disp.el which can sort of do what you want,
but it uses Latin-1 characters instead of Latin-9 characters.  So
you'd have to make a new version which is very similar but uses
Latin-9 characters.  Then you can tell Emacs to display many
characters using their Latin-9 equivalents.

kai
-- 
A large number of young women don't trust men with beards.  (BFBS Radio)



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: locale ISO-8859-15, UTF-8 and mail...
  2002-07-18 11:41 ` Kai Großjohann
@ 2002-07-18 16:12   ` Fabien Niñoles
       [not found]   ` <878z49dnk3.fsf@tzone.org>
  1 sibling, 0 replies; 6+ messages in thread
From: Fabien Niñoles @ 2002-07-18 16:12 UTC (permalink / raw)
  Cc: ding

>>>>> "Kai" == Kai Großjohann <Kai.Grossjohann@CS.Uni-Dortmund.DE> writes:

    Kai> fabien@tzone.org (Fabien Niñoles) writes:
    >> My problem is when I try to read mail encoded in ISO-8859-1.
    >> Gnus doesn't seems to recognized them correctly and aren't able
    >> to trancode them to my correct display setting (ISO-8859-15 or
    >> latin-9 if you prefer).  Even "W M c" doesn't wash the mail
    >> correctly: the characters (previously printed as ?) are
    >> correctly replaced but have a \201 before each occurence
    >> (whatever the encoded character is).

    Kai> There is a file latin1-disp.el which can sort of do what you
    Kai> want, but it uses Latin-1 characters instead of Latin-9
    Kai> characters.  So you'd have to make a new version which is
    Kai> very similar but uses Latin-9 characters.  Then you can tell
    Kai> Emacs to display many characters using their Latin-9
    Kai> equivalents.

Done, I send the file to Dave Love (the original maintainer of latin-1)
but, if it works for mostly anything (like this email: ¾), it doesn't
work when I visiting Latin-1 encoded email; or even in mail-headers.

Something like "élève" will go out has "?l?ve", and, if I "C-u W M c
latin-9", it will be "\201él\201ève".  Just watching it with "W M c",
only add \201 before all `?'.  I'm looking for adding a command that
simple (gnus-article-decode-charset nil 'latin-9) and remove all \201, 
but I will frankly prefer to make it works correctly.

hmmm...

I try to isolate the bug a little number and it seems that it's somewhere
in decode-coding-region and/or decode-coding-string (both built-in).
At least, this 

(mm-decode-string "élève" "iso-8859-1") 

replace each accentuated character with the sequence "\216?" and

(mm-decode-string "élève" "iso-8859-15")

simply add \216 front to each accentuated character.

So it seems to me that the only bug is the addition of the \216 (or \201
for decode-region) after the translation of a character...  BTW, I'm on
powerpc, maybe it's relevant (unsigned char and consor bugs...).

Thanks,
Fabien 

-- 
Fabien Niñoles                              Debian Maintainer
fabien@debian.org                       http://www.debian.org
GPG KeyID: C15D FE9E BB35 F596 127F  BF7D 8F1F DFC9 BCE0 9436

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: locale ISO-8859-15, UTF-8 and mail...
       [not found]     ` <vaf4rew25r4.fsf@INBOX.auto.gnus.tok.lucy.cs.uni-dortmund.de>
@ 2002-07-18 20:44       ` Fabien Niñoles
  2002-07-18 20:56         ` Jorgen Schaefer
  2002-07-19  7:19         ` Kai Großjohann
  0 siblings, 2 replies; 6+ messages in thread
From: Fabien Niñoles @ 2002-07-18 20:44 UTC (permalink / raw)
  Cc: ding

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>>>>> "Kai" == Kai Großjohann <Kai.Grossjohann@CS.Uni-Dortmund.DE> writes:

    Kai> fabien@tzone.org (Fabien Niñoles) writes:
    >> Done, I send the file to Dave Love (the original maintainer of
    >> latin-1) but, if it works for mostly anything (like this email:
    >> ¾), it doesn't work when I visiting Latin-1 encoded email; or
    >> even in mail-headers.
    >> 
    >> Something like "élève" will go out has "?l?ve", and, if I "C-u
    >> W M c latin-9", it will be "\201él\201ève".  Just watching it
    >> with "W M c", only add \201 before all `?'.  I'm looking for
    >> adding a command that simple (gnus-article-decode-charset nil
    >> 'latin-9) and remove all \201, but I will frankly prefer to
    >> make it works correctly.

    Kai> I don't understand.  What does `W M c' have to do with
    Kai> latinX-disp.el?

Nothing.  There seems to have two "bugs" here:

1- Gnus seems to defined it's own buffer-display-table instead of
using the standard-display-table.  Since latinX-disp.el simply modify
the standard-display-table to be effective, it doesn't seems to work in gnus
(at least v5.9.0 that I used here).

2- I isolate the other bug (the \201 or \216 character add before every high-bit set
character in multibyte environment) to a bug in the coding-system.  To test
it just evaluate this:

(decode-coding-string "élève" 'iso-8859-1) => "\216él\216ève" in latin-1 environment.

You can replace the coding system for anything, even no-conversion or
raw-text, it will always add the extra character, whatever your
environment.

Since decode-coding-string are a built-in function, I'm currently rebuilding
emacs to debug this a little more.

    Kai> IMHO, it would be better to enable Emacs to just display many
    Kai> characters, rather than making changes in Gnus only.

The problem 1) is in Gnus and I don't know if it exists in the CVS.  It will
certainly not be easy to fixed since it should appears in the encoding system
of gnus.  Maybe someone can help me about this one.

For the other one, it's a problem in my emacs version... I continue to
investigate.

Ciao!
Fabien
- -- 
Fabien Niñoles                              Debian Maintainer
fabien@debian.org                       http://www.debian.org
GPG KeyID: C15D FE9E BB35 F596 127F  BF7D 8F1F DFC9 BCE0 9436
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.6 <http://mailcrypt.sourceforge.net/>

iD8DBQE9NyiIjx/fybzglDYRAiJhAJ4sAOC7/oFtMmGSTGH+qSLSFIvnxwCeJ78n
2abtUFTl8TPntykE0uRvH0g=
=OYfA
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: locale ISO-8859-15, UTF-8 and mail...
  2002-07-18 20:44       ` Fabien Niñoles
@ 2002-07-18 20:56         ` Jorgen Schaefer
  2002-07-19  7:19         ` Kai Großjohann
  1 sibling, 0 replies; 6+ messages in thread
From: Jorgen Schaefer @ 2002-07-18 20:56 UTC (permalink / raw)


fabien@tzone.org (Fabien Niñoles) writes:

>>> Something like "élève" will go out has "?l?ve", and, if I "C-u
>>> W M c latin-9", it will be "\201él\201ève".  Just watching it
>>> with "W M c", only add \201 before all `?'.  I'm looking for
>>> adding a command that simple (gnus-article-decode-charset nil
>>> 'latin-9) and remove all \201, but I will frankly prefer 
>>> make it works correctly.

I don't know wether this helps, but I have a default setup of
latin-1, and have the same problem viewing latin-9 messages.

In Message-ID: <0acc5b761e7260decdcdbcb5869bcce6@fitug.de>, both
W M c and C-u W M c latin-1 convert "n?herungsweise" to
"n\216äherungsweise".

> 2- I isolate the other bug (the \201 or \216 character add
> before every high-bit set character in multibyte environment) to
> a bug in the coding-system. To test it just evaluate this:
> 
> (decode-coding-string "élève" 'iso-8859-1) => "\216él\216ève" in latin-1 environment.

This gives "\201él\201ève" here, on an emacs 21.2.1 right from
Debian unstable.

HTH,
        -- Jorgen

-- 
((email . "forcer@forcix.cx") (www . "http://www.forcix.cx/")
 (gpg   . "1024D/028AF63C")   (irc . "nick forcer on IRCnet"))



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: locale ISO-8859-15, UTF-8 and mail...
  2002-07-18 20:44       ` Fabien Niñoles
  2002-07-18 20:56         ` Jorgen Schaefer
@ 2002-07-19  7:19         ` Kai Großjohann
  1 sibling, 0 replies; 6+ messages in thread
From: Kai Großjohann @ 2002-07-19  7:19 UTC (permalink / raw)
  Cc: ding

fabien@tzone.org (Fabien Niñoles) writes:

>>>>>> "Kai" == Kai Großjohann <Kai.Grossjohann@CS.Uni-Dortmund.DE> writes:
>
>     Kai> I don't understand.  What does `W M c' have to do with
>     Kai> latinX-disp.el?
>
> Nothing.  There seems to have two "bugs" here:
>
> 1- Gnus seems to defined it's own buffer-display-table instead of
> using the standard-display-table.  Since latinX-disp.el simply
> modify the standard-display-table to be effective, it doesn't seems
> to work in gnus (at least v5.9.0 that I used here).

What does Gnus use that display table for?  Maybe it's used for
something useful.  But maybe Gnus should inherit from the standard
display table, and then frobbing that first with latinX-disp should
do the trick.

> 2- I isolate the other bug (the \201 or \216 character add before
> every high-bit set character in multibyte environment) to a bug in
> the coding-system.  To test it just evaluate this:
>
> (decode-coding-string "élève" 'iso-8859-1) => "\216él\216ève" in latin-1 environment.

The documentation for decode-coding-string says that it is required
for the string to be encoded in the given coding system.  Maybe your
string is in iso-8859-15 and not iso-8859-1.  If you give
decode-coding-string some Japanese and tell it to decode as
iso-8859-1, it won't be surprising that you get garbage, either.

\216\151 is the internal Emacs representation (the Mule encoding) of
that accented character in Latin-9.  \201\151 would be the internal
representation of the same character in Latin-1.

You can type C-u C-x = on the é character to verify whether it is in
iso-8859-1.

Hm.

Hm.

Also, I think that the function has been misapplied.
decode-coding-string and encode-coding-string are for converting
between external and internal representations.  So,
(decode-coding-string X 'iso-8859-1) takes a sequence X of bytes
encoded as iso-8859-1 and returns a string encoded in Emacs' internal
encoding.  But you were passing not a sequence of bytes encoded in
iso-8859-1, you were passing a Lisp string encoded in Emacs' internal
encoding.  And that encoding happens to use \216\151 for the é
character, so that's what you get...

Maybe you can use vector to create a proper sequence of bytes.
Or you get the bytes from an I/O operation (input, to be specific).

kai
-- 
A large number of young women don't trust men with beards.  (BFBS Radio)

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2002-07-19  7:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-07-17 18:01 locale ISO-8859-15, UTF-8 and mail Fabien Niñoles
2002-07-18 11:41 ` Kai Großjohann
2002-07-18 16:12   ` Fabien Niñoles
     [not found]   ` <878z49dnk3.fsf@tzone.org>
     [not found]     ` <vaf4rew25r4.fsf@INBOX.auto.gnus.tok.lucy.cs.uni-dortmund.de>
2002-07-18 20:44       ` Fabien Niñoles
2002-07-18 20:56         ` Jorgen Schaefer
2002-07-19  7:19         ` Kai Großjohann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).