Gnus: UTF-8 and compatibility with other MUAs

Gnus development mailing list
 help / color / mirror / Atom feed

* Gnus: UTF-8 and compatibility with other MUAs
@ 2003-08-14 15:48 Xavier Maillard
  2003-08-14 22:39 ` Frank Schmitt
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Xavier Maillard @ 2003-08-14 15:48 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 506 bytes --]

Hi,

I know Emacs is able to use utf-8 encoding so Gnus is.

My question is more a question of compliance with other MUAs.
Would you recommend your users to use utf-8 as a default encoding
system ? AFAIK, I can't see many MUAs aware of it and worst almost
nobody is using utf-8 which was presented as the future. So what is the
problem with utf in general that prevent users in general to use it
defaultly ?

Regards,

zeDek
-- 
alt.mcdonalds                     Can I get fries with that?

[-- Attachment #2: Type: application/pgp-signature, Size: 188 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-14 15:48 Gnus: UTF-8 and compatibility with other MUAs Xavier Maillard
@ 2003-08-14 22:39 ` Frank Schmitt
  2003-08-15 18:22   ` Xavier Maillard
  2003-08-14 23:01 ` Jesper Harder
  2003-08-14 23:05 ` Simon Josefsson
  2 siblings, 1 reply; 37+ messages in thread
From: Frank Schmitt @ 2003-08-14 22:39 UTC (permalink / raw)


Xavier Maillard <zedek@gnu-rox.org> writes:

> My question is more a question of compliance with other MUAs.
> Would you recommend your users to use utf-8 as a default encoding
> system ? AFAIK, I can't see many MUAs aware of it and worst almost
> nobody is using utf-8 which was presented as the future. So what is the
> problem with utf in general that prevent users in general to use it
> defaultly ?

Well, it's the chicken-egg-problem. People don't use UTF-8 since quite
some MUAs don't support it and some authors of MUAs don't add support
since few people use it.

Nevertheless I've got the impression that today most MUAs will handle
Unicode quite well. I send UTF-8 in both Mail and News and few people
told me they couldn't read my messages properly.

-- 
Did you ever realize how much text fits in eighty columns? If you now consider
that a signature usually consists of up to four lines, this gives you enough
space to spread a tremendous amount of information with your messages. So seize
this opportunity and don't waste your signature with bullshit nobody will read.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-14 15:48 Gnus: UTF-8 and compatibility with other MUAs Xavier Maillard
  2003-08-14 22:39 ` Frank Schmitt
@ 2003-08-14 23:01 ` Jesper Harder
  2003-08-15 13:50   ` Oliver Scholz
  2003-08-15 18:24   ` Xavier Maillard
  2003-08-14 23:05 ` Simon Josefsson
  2 siblings, 2 replies; 37+ messages in thread
From: Jesper Harder @ 2003-08-14 23:01 UTC (permalink / raw)

Xavier Maillard <zedek@gnu-rox.org> writes:

> I know Emacs is able to use utf-8 encoding so Gnus is.
>
> My question is more a question of compliance with other MUAs.
> Would you recommend your users to use utf-8 as a default encoding
> system ?

No, because there's no reason to use UTF-8 if a more widely supported
charset is sufficient.

To use UTF-8 by default would also be against RFC 2046:

,----[ RFC 2046, Section 4.1.2. ]
|
|    In general, composition software should always use the "lowest common
|    denominator" character set possible.  For example, if a body contains
|    only US-ASCII characters, it SHOULD be marked as being in the US-
|    ASCII character set, not ISO-8859-1, which, like all the ISO-8859
|    family of character sets, is a superset of US-ASCII.  More generally,
|    if a widely-used character set is a subset of another character set,
|    and a body contains only characters in the widely-used subset, it
|    should be labelled as being in that subset.  This will increase the
|    chances that the recipient will be able to view the resulting entity
|    correctly.
`----

But if the message contains characters (or combination of characters)
where a _single_ iso-8859-x charset can't be used, then by all means
use UTF-8.  This is far better than sending a multipart message (which
Gnus does if UTF-8 isn't available).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-14 15:48 Gnus: UTF-8 and compatibility with other MUAs Xavier Maillard
  2003-08-14 22:39 ` Frank Schmitt
  2003-08-14 23:01 ` Jesper Harder
@ 2003-08-14 23:05 ` Simon Josefsson
  2003-08-15 17:00   ` Oliver Scholz
  2 siblings, 1 reply; 37+ messages in thread
From: Simon Josefsson @ 2003-08-14 23:05 UTC (permalink / raw)
  Cc: ding

Xavier Maillard <zedek@gnu-rox.org> writes:

> Hi,
>
> I know Emacs is able to use utf-8 encoding so Gnus is.
>
> My question is more a question of compliance with other MUAs.
> Would you recommend your users to use utf-8 as a default encoding
> system ? AFAIK, I can't see many MUAs aware of it and worst almost
> nobody is using utf-8 which was presented as the future. So what is the
> problem with utf in general that prevent users in general to use it
> defaultly ?

IMHO:

Users should use the oldest charset widely deployed, or preferred, in
their own geographic region that is able to encode what they write.

This means if a user write only ASCII, it is tagged as ASCII (or
rather not tagged at all).

And if a (northern?) European user write å it should use iso-8859-1.

And if a european user write Ελληνικά it should use iso-8859-7.

And if a european user write € it should use iso-8859-15.  (One could
argue that iso-8859-15 is too recent and that it may make sense to go
directly to UTF-8, but my experience, as a northern european user, is
that iso-8859-15 is more appropriate, since the almost-compatibility
with iso-8859-1 is friendlier for people with old software.)

And if a european user write € and ά it should use UTF-8.  (I'm
assuming no 8859-* can encode both € and ά.)

This also means that it is wrong to use JP-2022-2, for european users,
even though it technically may be able to encode some strings, that
contain characters from 8859-* that isn't available in any single
8859-*.  Instead they should go to UTF-8.

I think this is how Gnus works though, unless you are in a UTF-8
locale and uses an old Emacs (then I think it will skip the 8859-*
step, but I might be wrong).

This logic might be flawed if the receiver is in another geographic
region, of if a user mostly communicate internationally.  Still, I'd
probably use the above logic even if I sent something to a Japanese
user, and expect them to use JP-2022-2 (or whatever) in return.

Perhaps some day we can try ASCII first, then fall back to UTF-8.  But
that will take a long time.  Even moving to ISO-8859-1 in northern
Europe took a long time, and still isn't finished.  I still use IBMPC2
(CP437?) in some regional communication channels.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-14 23:01 ` Jesper Harder
@ 2003-08-15 13:50   ` Oliver Scholz
  2003-08-15 16:48     ` Jesper Harder
  2003-08-15 18:24   ` Xavier Maillard
  1 sibling, 1 reply; 37+ messages in thread
From: Oliver Scholz @ 2003-08-15 13:50 UTC (permalink / raw)

Jesper Harder <harder@myrealbox.com> writes:
[...]
> To use UTF-8 by default would also be against RFC 2046:
>
> ,----[ RFC 2046, Section 4.1.2. ]
> |
> |    In general, composition software should always use the "lowest common
> |    denominator" character set possible.  For example, if a body contains
> |    only US-ASCII characters, it SHOULD be marked as being in the US-
> |    ASCII character set, not ISO-8859-1, which, like all the ISO-8859
> |    family of character sets, is a superset of US-ASCII.  More generally,
> |    if a widely-used character set is a subset of another character set,
> |    and a body contains only characters in the widely-used subset, it
> |    should be labelled as being in that subset.  This will increase the
> |    chances that the recipient will be able to view the resulting entity
> |    correctly.
> `----
[...]

That's not how I read the section you quoted. In my reading this
means that you should not declare the message to be in UTF-8, when it
contains only ASCII characters. For characters from the right hand
part of ISO 8859-1 this is not so simple: Latin-1 (as a coded
character set) may be a subset of UCS. But Latin-1 (as a character
encoding scheme) is _not_ a subset of UTF-8.

The lowest common denominator for most German text is ISO 646-DE. For
most Danish text (I presume) ISO 646-DK. Virtually nobody uses those
coding systems anymore, and IMNSHO nobody should use them. (I have
implemented ISO 646-DE for GNU Emacs in a way that it could be easily
extended to other national variants of ISO 646, in case you are
interested ...)

Sure, one could say that the national variants of ISO 646 are excluded
by the phrase “widely-used character sets”, but that is a bit too
fuzzy for my taste. Taken literally nobody should use ISO 8859-15
then, unless the message really contains an € (or one of the other 7
characters). Maybe this is what this section wants to say, but then I
dare say that it doesn't make much sense as a technical rule and I am
glad that it is not stated in a way that makes it mandatory.

    Oliver
-- 
28 Thermidor an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-15 13:50   ` Oliver Scholz
@ 2003-08-15 16:48     ` Jesper Harder
  2003-08-15 18:10       ` Oliver Scholz
  0 siblings, 1 reply; 37+ messages in thread
From: Jesper Harder @ 2003-08-15 16:48 UTC (permalink / raw)

Oliver Scholz <alkibiades@gmx.de> writes:

> The lowest common denominator for most German text is ISO
> 646-DE. For most Danish text (I presume) ISO 646-DK. Virtually
> nobody uses those coding systems anymore, and IMNSHO nobody should
> use them.

The RFC does say that ISO-8859 is prefered over ISO 646:

   Note that the ISO 646 character sets have deliberately been omitted
   in favor of their 8859 replacements, which are the designated
   character sets for Internet mail.

> Taken literally nobody should use ISO 8859-15 then, unless the
> message really contains an € (or one of the other 7
> characters). 

I agree with that.  I don't see _any_ reason to use latin-9 if you
don't need it.  Some MUA's don't support latin-9 (including older
versions of Gnus) -- why break those clients for no good reason?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-14 23:05 ` Simon Josefsson
@ 2003-08-15 17:00   ` Oliver Scholz
  2003-08-16  7:43     ` Ivan Boldyrev
  2003-08-18  6:01     ` Steinar Bang
  0 siblings, 2 replies; 37+ messages in thread
From: Oliver Scholz @ 2003-08-15 17:00 UTC (permalink / raw)

Simon Josefsson <jas@extundo.com> writes:

> Xavier Maillard <zedek@gnu-rox.org> writes:
[...]
>> My question is more a question of compliance with other MUAs.
>> Would you recommend your users to use utf-8 as a default encoding
>> system ? AFAIK, I can't see many MUAs aware of it and worst almost
>> nobody is using utf-8 which was presented as the future. So what is the
>> problem with utf in general that prevent users in general to use it
>> defaultly ?

I have been using UTF-8 as a default in Mails&News for over a year
now. It is sometimes problematic, but even if the MUA on the other end
does not cope with UTF-8 it never makes my (western european) text
entirely unreadable.

Sure, that is still not nice. But like Frank I see it as an
chicken-and-egg problem: I decided once that I was going to promote
UTF-8 by using it. I realized then that virtually none of my
non-technical-oriented friends had any problems with UTF-8, since
they use programs like Outlook, Mozilla or some obscure Macintosh
MUA, whose name I have forgot. The only major group of people who
have problems with UTF-8 are computer-literates. This seems weird to
me. I wouldn't use UTF-8 if it were the other way around.

I don't think that this sort of UTF-8 radicalism is the right thing
for everyone. Simon's suggestions demonstrate nicely the
tower-of-babel situation resulting from the current flood of coding
systems, but I have to admit that they also indicate the most sensible
way to deal with those things, if you want to maximize the chance that
your text is flawlessly readable at the other end.

But I do think that *some* people should start to use UTF-8 as a
default.

[...]
> And if a european user write € it should use iso-8859-15.  (One could
> argue that iso-8859-15 is too recent and that it may make sense to go
> directly to UTF-8, but my experience, as a northern european user, is
> that iso-8859-15 is more appropriate, since the almost-compatibility
> with iso-8859-1 is friendlier for people with old software.)

This seems to make sense. But how good is it working in your
experience? I seems possible to me that a non-Latin-9-aware MUA/NUA
could try to display a message with an iso8859-15 charset header as
ascii, so that not even Latin-1-compatible chars would be displayed
correctly. Does that happen with some MUAs?

However, I tend to think that it's better to write EUR instead of €,
if you want to avoid UTF-8.

[...]
> Perhaps some day we can try ASCII first, then fall back to UTF-8.  But
> that will take a long time.  Even moving to ISO-8859-1 in northern
> Europe took a long time, and still isn't finished.  I still use IBMPC2
> (CP437?) in some regional communication channels.
[...]

I think it is in general a good idea to choose the encoding according
to the audience. Fortunately this is not hard with Gnus. There are
some people to which I send my mail in Latin-1.

    Oliver
-- 
28 Thermidor an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-15 16:48     ` Jesper Harder
@ 2003-08-15 18:10       ` Oliver Scholz
  2003-08-16  0:23         ` Jesper Harder
  0 siblings, 1 reply; 37+ messages in thread
From: Oliver Scholz @ 2003-08-15 18:10 UTC (permalink / raw)

Jesper Harder <harder@myrealbox.com> writes:

> Oliver Scholz <alkibiades@gmx.de> writes:
>
>> The lowest common denominator for most German text is ISO
>> 646-DE. For most Danish text (I presume) ISO 646-DK. Virtually
>> nobody uses those coding systems anymore, and IMNSHO nobody should
>> use them.
>
> The RFC does say that ISO-8859 is prefered over ISO 646:
>
>    Note that the ISO 646 character sets have deliberately been omitted
>    in favor of their 8859 replacements, which are the designated
>    character sets for Internet mail.
>

Hmm. I guess it's time for me to finally read RFC 2046 ...

>> Taken literally nobody should use ISO 8859-15 then, unless the
>> message really contains an € (or one of the other 7
>> characters). 
>
> I agree with that.  I don't see _any_ reason to use latin-9 if you
> don't need it.  Some MUA's don't support latin-9 (including older
> versions of Gnus) -- why break those clients for no good reason?

Well, I think, if you want to maximize the chance that your message
is flawlessly readable at the other end, this makes sense as a
pragmatic rule.

As a technical rule, however, which is important for the question
whether a message is fully RFC compliant or not, it does not make
sense.

BTW, if the rule were that we should use the smallest, most widely
used coded character set which covers the all necessary characters in
a message, then western European users should use neither Latin-1 nor
Latin-9, but windows-1252.

However, from the section you quotet alone it is not entirely clear
whether it refers to absctract characters, code points in a coded
character set or octets in a character encoding scheme. The term
“character set” may seem to indicate that they are talking about coded
character sets, but RFC 2046 refers to RFC 2045 for the definition of
the term “character set”. There it reads:

   NOTE: The term "character set" was originally to describe such
   straightforward schemes as US-ASCII and ISO-8859-1 which have a
   simple one-to-one mapping from single octets to single characters.
   Multi-octet coded character sets and switching techniques make the
   situation more complex. For example, some communities use the term
   "character encoding" for what MIME calls a "character set", while
   using the phrase "coded character set" to denote an abstract mapping
   from integers (not octets) to characters.

So I'd say “character set” refers to the character encoding
scheme. And in this sense the rule makes sense: if a message contains
only characters from the ASCII repertoire it should be declared as
US-ASCII, not as UTF-8. But that does not extend to ISO
8859-[[:digit:]]+, since UTF-8 and Latin-1 are not compatible.

    Oliver
-- 
28 Thermidor an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-14 22:39 ` Frank Schmitt
@ 2003-08-15 18:22   ` Xavier Maillard
  0 siblings, 0 replies; 37+ messages in thread
From: Xavier Maillard @ 2003-08-15 18:22 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 1200 bytes --]

Frank Schmitt <usereplyto@frank-schmitt.net> writes:

>  Xavier Maillard <zedek@gnu-rox.org> writes:
>  
> >  My question is more a question of compliance with other MUAs.
> >  Would you recommend your users to use utf-8 as a default encoding
> >  system ? AFAIK, I can't see many MUAs aware of it and worst almost
> >  nobody is using utf-8 which was presented as the future. So what is
> >  the problem with utf in general that prevent users in general to
> >  use it defaultly ?
>  
>  Well, it's the chicken-egg-problem. People don't use UTF-8 since
>  quite some MUAs don't support it and some authors of MUAs don't add
>  support since few people use it.

Yep I have the same impression right now.
 
>  Nevertheless I've got the impression that today most MUAs will handle
>  Unicode quite well. I send UTF-8 in both Mail and News and few people
>  told me they couldn't read my messages properly.

They stay readable but seems to be pain to read a mail containing
accentuated characters. 

zeDek
-- 
"Die Geteilten selbst sind jedoch nie Feindbild der Vereiner, denn dies
 sind immer nur die Teiler." 
		Norbert Harry Marzahn <70oKlG6LbXB@nm01.vision.IN-BRB.DE>


[-- Attachment #2: Type: application/pgp-signature, Size: 188 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-14 23:01 ` Jesper Harder
  2003-08-15 13:50   ` Oliver Scholz
@ 2003-08-15 18:24   ` Xavier Maillard
  2003-08-16  0:35     ` Jesper Harder
  1 sibling, 1 reply; 37+ messages in thread
From: Xavier Maillard @ 2003-08-15 18:24 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 1706 bytes --]

Jesper Harder <harder@myrealbox.com> writes:

>  Xavier Maillard <zedek@gnu-rox.org> writes:
>  
> >  I know Emacs is able to use utf-8 encoding so Gnus is.
> >  
> >  My question is more a question of compliance with other MUAs.
> >  Would you recommend your users to use utf-8 as a default encoding
> >  system ?
>  
>  No, because there's no reason to use UTF-8 if a more widely supported
>  charset is sufficient.

Ok for that. So what would be the default charset to recommend to
people ? 

Why the hell was utf-8 invented so ?
  
>  To use UTF-8 by default would also be against RFC 2046:
>  
>  ,----[ RFC 2046, Section 4.1.2. ]
>  |
>  |    In general, composition software should always use the "lowest
>  |    common denominator" character set possible.  For example, if a
>  |    body contains only US-ASCII characters, it SHOULD be marked as
>  |    being in the US- ASCII character set, not ISO-8859-1, which,
>  |    like all the ISO-8859 family of character sets, is a superset of
>  |    US-ASCII.  More generally, if a widely-used character set is a
>  |    subset of another character set, and a body contains only
>  |    characters in the widely-used subset, it should be labelled as
>  |    being in that subset.  This will increase the chances that the
>  |    recipient will be able to view the resulting entity correctly.
>  `----  
>  But if the message contains characters (or combination of characters)
>  where a _single_ iso-8859-x charset can't be used, then by all means
>  use UTF-8.  This is far better than sending a multipart message
>  (which Gnus does if UTF-8 isn't available).

Thanx for the hint.

zeDek
-- 
"Just did it."

[-- Attachment #2: Type: application/pgp-signature, Size: 188 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-15 18:10       ` Oliver Scholz
@ 2003-08-16  0:23         ` Jesper Harder
  2003-08-16  9:48           ` Oliver Scholz
  0 siblings, 1 reply; 37+ messages in thread
From: Jesper Harder @ 2003-08-16  0:23 UTC (permalink / raw)

Oliver Scholz <alkibiades@gmx.de> writes:

> Well, I think, if you want to maximize the chance that your message
> is flawlessly readable at the other end

That _is_ the raison d'être for MIME after all.

> As a technical rule, however, which is important for the question
> whether a message is fully RFC compliant or not, it does not make
> sense.

To be fair, the RFC does recognize that they weren't able to specify
exact rules at the time:

   The character sets specified above are the ones that were relatively
   uncontroversial during the drafting of MIME.  This document does not
   endorse the use of any particular character set other than US-ASCII,
   and recognizes that the future evolution of world character sets
   remains unclear.

> BTW, if the rule were that we should use the smallest, most widely
> used coded character set which covers the all necessary characters in
> a message, then western European users should use neither Latin-1 nor
> Latin-9, but windows-1252.

No, because Windows-1252 isn't a standard, i.e. endorsed by IETF, ISO
or another reputable standards body.  (IANA registration doesn't make
it a standard -- anyone can in principle register any old homebrewed
charset with IANA).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-15 18:24   ` Xavier Maillard
@ 2003-08-16  0:35     ` Jesper Harder
  0 siblings, 0 replies; 37+ messages in thread
From: Jesper Harder @ 2003-08-16  0:35 UTC (permalink / raw)

Xavier Maillard <zedek@gnu-rox.org> writes:

> Jesper Harder <harder@myrealbox.com> writes:
>
>>  Xavier Maillard <zedek@gnu-rox.org> writes:
>>  
>> >  Would you recommend your users to use utf-8 as a default
>> >  encoding system ?
>>  
>>  No, because there's no reason to use UTF-8 if a more widely supported
>>  charset is sufficient.
>
> Ok for that. So what would be the default charset to recommend to
> people ? 

I would just leave the default setting in Gnus as it is.

> Why the hell was utf-8 invented so ?

To extend the repertoire of available glyphs and allow people to mix
glyphs from different scripts.

UTF-8 is excellent if you need it.  But most people usually don't need
to mix Vietnamese and Thai words, write hieroglyphics or runes and so
on.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-15 17:00   ` Oliver Scholz
@ 2003-08-16  7:43     ` Ivan Boldyrev
  2003-08-17 17:27       ` Oliver Scholz
  2003-08-18  6:01     ` Steinar Bang
  1 sibling, 1 reply; 37+ messages in thread
From: Ivan Boldyrev @ 2003-08-16  7:43 UTC (permalink / raw)


On 8472 day of my life Oliver Scholz wrote:
>> Perhaps some day we can try ASCII first, then fall back to UTF-8.  But
>> that will take a long time.  Even moving to ISO-8859-1 in northern
>> Europe took a long time, and still isn't finished.  I still use IBMPC2
>> (CP437?) in some regional communication channels.
> [...]
>
> I think it is in general a good idea to choose the encoding according
> to the audience. Fortunately this is not hard with Gnus. There are
> some people to which I send my mail in Latin-1.

Do you use special group for them or do something more tricky?

-- 
Ivan Boldyrev

                                Onions has layers.  Unix has layers too.




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-16  0:23         ` Jesper Harder
@ 2003-08-16  9:48           ` Oliver Scholz
  2003-08-16 13:01             ` Jesper Harder
  0 siblings, 1 reply; 37+ messages in thread
From: Oliver Scholz @ 2003-08-16  9:48 UTC (permalink / raw)

Jesper Harder <harder@myrealbox.com> writes:

> Oliver Scholz <alkibiades@gmx.de> writes:
>
>> Well, I think, if you want to maximize the chance that your message
>> is flawlessly readable at the other end
>
> That _is_ the raison d'être for MIME after all.

Yes, but I think we'd both agree that the chance is rather small, if I
started to use the MIME compliant coding system ISO 2022 in western
Europe.

IMO Unicode offers the chance to escape the current tower-of-babel
situation as far as character encodings are concerned. I'd like to
compare the current state of affairs with western Europe in
pre-Latin-1 time. I'd like to put it this way:

If you are satisfied with a _fair_ chance to be flawlessly readable
at the other end, you may use UTF-8.

If you want to _maximize_ the chance that you are flawlessly readable
at the other end, but don't want to sacrifice important national
characters, you should follow the rules which Simon pointed out.

If you want to be _sure_ that you are flawlessly readable at the
other end, you should use US-ASCII. In Germany there are at least two
conventions to express umlauts in plain ASCII. I'd guess that similar
conventions exist for other languages.

How long it will take for Unicode to become as widespread in western
Europe as Latin-1 is now -- I don't know. But so far it has spread
very rapidly.

[...]
>> BTW, if the rule were that we should use the smallest, most widely
>> used coded character set which covers the all necessary characters in
>> a message, then western European users should use neither Latin-1 nor
>> Latin-9, but windows-1252.
>
> No, because Windows-1252 isn't a standard, i.e. endorsed by IETF, ISO
> or another reputable standards body.  (IANA registration doesn't make
> it a standard -- anyone can in principle register any old homebrewed
> charset with IANA).

[Aside: Hmm, maybe it could be funny to register emacs-mule ...]

I also prefer standards developed by official standards bodies,
especially such like ISO, CEN (Europe) and DIN (Germany), because they
are at least indirectly under democratic control.

However, there are also things that are de facto standards, because of
their widespread use. Windows-1252, Postscript and the English
language, for example. I am pretty sure that there are more people
around whose MUAs/NUAs can deal with windows-1252 than with Latin-9.

    Oliver
-- 
29 Thermidor an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-16  9:48           ` Oliver Scholz
@ 2003-08-16 13:01             ` Jesper Harder
  2003-08-16 15:36               ` Oliver Scholz
  0 siblings, 1 reply; 37+ messages in thread
From: Jesper Harder @ 2003-08-16 13:01 UTC (permalink / raw)

Oliver Scholz <alkibiades@gmx.de> writes:

> If you are satisfied with a _fair_ chance to be flawlessly readable
> at the other end, you may use UTF-8.

But the purpose of email is to _communicate_.  Why lower you chance of
cummunicating if there is no compelling technical reason to do so?

> How long it will take for Unicode to become as widespread in western
> Europe as Latin-1 is now -- I don't know. But so far it has spread
> very rapidly.

1. Application support isn't that great.  Emacs, (La)TeX and Texinfo
   don't support Unicode fully (those are some of the most important
   applications as far as I'm concerned).

2. Unicode support itself doesn't really buy me a lot if most people
   don't have fairly complete Unicode fonts (which they don't).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-16 13:01             ` Jesper Harder
@ 2003-08-16 15:36               ` Oliver Scholz
  2003-08-16 17:14                 ` Reiner Steib
                                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Oliver Scholz @ 2003-08-16 15:36 UTC (permalink / raw)

Jesper Harder <harder@myrealbox.com> writes:

> Oliver Scholz <alkibiades@gmx.de> writes:
>
>> If you are satisfied with a _fair_ chance to be flawlessly readable
>> at the other end, you may use UTF-8.
>
> But the purpose of email is to _communicate_.  Why lower you chance of
> cummunicating if there is no compelling technical reason to do so?

First of all: I am not talking about UTF-16 or UTF-7, and I am not
talking about Greek, Hebrew or Arabic. I am talking about UTF-8 for
Latin-based scripts. Even if there is no UTF-8 support at all at the
other end, communication won't fail. As things stand I would not yet
recommend UTF-8 to a Greek user, for example. Now and then I realize
in German Usenet, that a few people who post replies to my articles
can not deal with UTF-8, because when they quote the text I wrote, I
see funny characters instead of umlauts. This is not a big impediment
to communication. I doubt that anybody would put me into his or her
killfile, because I use UTF-8.

And, yes, there is a technical reason that Unicode should become the
default text encoding in the future. The fact that we have a myriad of
different encodings to choose from causes a lot of trouble; just
consider how many questions there are in the various Emacs newsgroups
about coding system issues; and this is just the top of the
iceberg. Sure, Unicode makes sometimes trouble, too. But at least one
could say that these are problems of transition. If we don't move to
Unicode in the future then coding system problems will go on forever
and ever.

If we stick to 256-characters encodings forever, then Latin-9 won't be
the last invention that we will have seen. There may be a need for a
new character in three, five, seven years. Who knows? Latin-10 is
already in final state. What should save us from Latin-11, Latin-12
.... Latin-N, if not a single unified encoding that is designed to
match any need now and in the future?

My guess -- by the way -- is that Unicode will become increasingly
important in Europe, especially for the members of the EU. We'd need
at least Latin-1/Latin-9, Latin-2 and Greek (ISO 8859-7). And I am not
sure if that already covers Latvian, Romanian and others. There will
be a growing need for an encoding that covers all of these languages.

Then, if you want to be absolutely sure that everything works as
expected, then you only option is ASCII. Maybe Latin-1 is also
o.k. for a Western European. But every encoding that contains an Euro
sign is a big no-no.

I really hope for a future (however remote it may be), where I can be
sure that every text file I find on a computer is either ASCII, UTF-8
or UTF-16. When we'll look back then, we will regard this whole ISO
8859-soup as something as strange and weird as EBDIC.

>> How long it will take for Unicode to become as widespread in western
>> Europe as Latin-1 is now -- I don't know. But so far it has spread
>> very rapidly.
>
> 1. Application support isn't that great.  Emacs, (La)TeX and Texinfo
>    don't support Unicode fully (those are some of the most important
>    applications as far as I'm concerned).

The Unicode support for Emacs is quite good; there may be issues with
CJK in the current released version of Emacs, but the rest works
fine. But yes, LaTex and Texinfo (especially Texinfo) need
fixing. Even I, Unicode-Jacobite that I am, use Latin-1 for my LaTeX
stuff. But AFAIK there is some work going on, fortunately. The babel
encoding (sic!) for classical Greek (to take an example that is
important for me) is a nuisance. It is about time for LaTeX to support
Unicode.

> 2. Unicode support itself doesn't really buy me a lot if most people
>    don't have fairly complete Unicode fonts (which they don't).
[...]

So the worst thing that could happen is that they see a hollow box now
and then. And yet some characters are more frequent than others. You
can probably rely on the fact that western Europeans have fonts that
contain the Latin-1 repertoire. Box drawing characters or symbols may
not be that frequent, but there is a good chance to get the additional
punctuation characters.

In the future, when UTF-8 will be the default in Mail and News, this
shouldn't be a problem anymore. People who read mailing lists about
classical Greek, will make sure that they have a font containing
“Greek Extended”; the regulars of alt.fan.tolkien (whatever) will make
sure that they can display Tengwar, Star Trek fans will use fonts
including Klingon etc. etc.

    Oliver
-- 
29 Thermidor an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-16 15:36               ` Oliver Scholz
@ 2003-08-16 17:14                 ` Reiner Steib
  2003-08-16 19:29                   ` Oliver Scholz
  2003-08-19 14:54                   ` Miles Bader
  2003-08-16 17:23                 ` Simon Josefsson
  2003-08-17  0:57                 ` Jesper Harder
  2 siblings, 2 replies; 37+ messages in thread
From: Reiner Steib @ 2003-08-16 17:14 UTC (permalink / raw)


On Sat, Aug 16 2003, Oliver Scholz wrote:

> The Unicode support for Emacs is quite good; there may be issues
> with CJK in the current released version of Emacs, but the rest
> works fine.

Not only in the released versions, see this thread on emacs-devel:
<URL:http://article.gmane.org/gmane.emacs.devel/13487>.

> But yes, LaTex and Texinfo (especially Texinfo) need fixing.

Texinfo until recently (I didn't find time to check 4.5 and 4.6)
didn't even support Latin-1 @documentencoding.  See
<URL:http://search.gmane.org/search.php?query=documentencoding&
group=gmane.comp.tex.texinfo.general>

> Even I, Unicode-Jacobite that I am, use Latin-1 for my LaTeX
> stuff. But AFAIK there is some work going on, fortunately. [...] It
> is about time for LaTeX to support Unicode.

Maybe interesting for you (I didn't test it.  BTW: Did you ever try
Omega?):

,----
| From: Frank Mittelbach <frank.mittelbach@latex-project.org>
| Newsgroups: de.comp.text.tex
| Subject: ankuendigung: utf8 support fuer inputenc
| Date: Mon, 26 May 2003 21:06:17 +0200
| Message-ID: <batoro$5cj$2@online.de>
`----

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo--- PGP key available via WWW   http://rsteib.home.pages.de/




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-16 15:36               ` Oliver Scholz
  2003-08-16 17:14                 ` Reiner Steib
@ 2003-08-16 17:23                 ` Simon Josefsson
  2003-08-16 19:18                   ` Oliver Scholz
                                     ` (2 more replies)
  2003-08-17  0:57                 ` Jesper Harder
  2 siblings, 3 replies; 37+ messages in thread
From: Simon Josefsson @ 2003-08-16 17:23 UTC (permalink / raw)
  Cc: ding

Oliver Scholz <alkibiades@gmx.de> writes:

> In the future, when UTF-8 will be the default in Mail and News, this
> shouldn't be a problem anymore. People who read mailing lists about
> classical Greek, will make sure that they have a font containing
> “Greek Extended”; the regulars of alt.fan.tolkien (whatever) will make
> sure that they can display Tengwar, Star Trek fans will use fonts
> including Klingon etc. etc.

Wasn't the Klingon proposal for Unicode rejected?  Tengwar has been a
proposal for ten years, or so, and nothing has happend, as far as I
know.

> I really hope for a future (however remote it may be), where I can be
> sure that every text file I find on a computer is either ASCII, UTF-8
> or UTF-16.

UTF-16?  It's not even a well define encoding scheme, two files may
contain the exact same Unicode code points, but may differ in a binary
comparison, due to byte ordering.  And concatenating two UTF-16
strings from different sources requires knowledge about the encoding.
And surrogate pairs complicate matters as well.

> When we'll look back then, we will regard this whole ISO 8859-soup
> as something as strange and weird as EBDIC.

I wish I could be that optimistic.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-16 17:23                 ` Simon Josefsson
@ 2003-08-16 19:18                   ` Oliver Scholz
  2003-08-16 22:24                     ` Simon Josefsson
  2003-08-18  2:09                   ` James H. Cloos Jr.
  2003-08-28 13:35                   ` Jens Müller
  2 siblings, 1 reply; 37+ messages in thread
From: Oliver Scholz @ 2003-08-16 19:18 UTC (permalink / raw)

Simon Josefsson <jas@extundo.com> writes:

> Oliver Scholz <alkibiades@gmx.de> writes:

[Klingon and Tengwar in Unicode]

> Wasn't the Klingon proposal for Unicode rejected?  Tengwar has been a
> proposal for ten years, or so, and nothing has happend, as far as I
> know.

I have no idea. I was just looking for exotic examples and these two
were the second and third ones that came to my mind.

[...]
> UTF-16?  It's not even a well define encoding scheme, two files may
> contain the exact same Unicode code points, but may differ in a binary
> comparison, due to byte ordering.  

That's what the byte order mark is for.

> And concatenating two UTF-16 strings from different sources requires
> knowledge about the encoding. And surrogate pairs complicate matters
> as well.

Why do you think that surrogate pairs complicate matters? There can't
be any confusion whether an arbitrary 16 bit value is part a surrogate
pair or not; and if it is, whether it is the higher surrogate or the
lower one. As for concatenating I'd say this depends on whether the
tools are able to deal with it. But I do have to admit that I have
zero experience with UTF-16. I don't know how good it is in daily
use. I use only UTF-8. I mentioned UTF-16 only because I am told that
it is important in some areas (Java, MS Windows, XML ...).

    Oliver
-- 
29 Thermidor an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-16 17:14                 ` Reiner Steib
@ 2003-08-16 19:29                   ` Oliver Scholz
  2003-08-19 14:54                   ` Miles Bader
  1 sibling, 0 replies; 37+ messages in thread
From: Oliver Scholz @ 2003-08-16 19:29 UTC (permalink / raw)

Reiner Steib <4.uce.03.r.s@nurfuerspam.de> writes:
[...]
> Maybe interesting for you (I didn't test it.  BTW: Did you ever try
> Omega?):
>
> ,----
> | From: Frank Mittelbach <frank.mittelbach@latex-project.org>
> | Newsgroups: de.comp.text.tex
> | Subject: ankuendigung: utf8 support fuer inputenc
> | Date: Mon, 26 May 2003 21:06:17 +0200
> | Message-ID: <batoro$5cj$2@online.de>
> `----
[...]

Thanks. It is good to know that this is going to be part of the main
distribution. I was not yet able to make it work, but I will try again
as soon as I have my GNU/Linux up and running again. I only hope that
it lets me write Greek text without a special switching command, which
was required by the previous effort from Dominique Unruh.

No, I have not yet tried Omega. I didn't understand the
documentation, or I didn't find the right documentation. Could
anybody point me to an introduction to Omega for LaTeX-dummies?

    Oliver
-- 
29 Thermidor an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-16 19:18                   ` Oliver Scholz
@ 2003-08-16 22:24                     ` Simon Josefsson
  2003-08-17 12:30                       ` Benjamin Riefenstahl
  2003-08-18  2:16                       ` James H. Cloos Jr.
  0 siblings, 2 replies; 37+ messages in thread
From: Simon Josefsson @ 2003-08-16 22:24 UTC (permalink / raw)
  Cc: ding

Oliver Scholz <alkibiades@gmx.de> writes:

> [...]
>> UTF-16?  It's not even a well define encoding scheme, two files may
>> contain the exact same Unicode code points, but may differ in a binary
>> comparison, due to byte ordering.  
>
> That's what the byte order mark is for.

But it doesn't solve the problem. 'cmp' still says the files are
different.  UTF-8 had a similar problem (overlong encodings) but that
has been fixed, UTF-16 and UTF-32 can't be.

>> And concatenating two UTF-16 strings from different sources requires
>> knowledge about the encoding. And surrogate pairs complicate matters
>> as well.
>
> Why do you think that surrogate pairs complicate matters? There can't
> be any confusion whether an arbitrary 16 bit value is part a surrogate
> pair or not; and if it is, whether it is the higher surrogate or the
> lower one.

One way to realize it is to compare UTF-16 with either UTF-8 or
UTF-32.  The surrogate pair construction make UTF-16 contain the
disadvantage of both UTF-8 and UTF-32, but none of their advantage.

The disadvantage with UTF-8 is that you don't know where a code value
ends within the encoded data without knowledge of UTF-8, and the
disadvantage with UTF-32 is that it wastes space since most data fit
in 16 bits or less.

If normal computers was 16 bit, I could understand the trade-off, but
with 32 bit (or more) machines you can remove one of the disadvantages
by choosing either UTF-8 or UTF-32 instead of UTF-16.

> As for concatenating I'd say this depends on whether the tools are
> able to deal with it.

Right, and many tools assume that if you receive two binary blobs A
and B which are said to contain text, you can form the concatenation
of the text by concatenating the binary blobs as A||B.  This is a
reasonable assumption, and it works for most encoding schemes,
including UTF-8.  It doesn't work for UTF-16 or UTF-32.

My preference is to use UTF-8 when data is stored or transfered, and
only use UTF-32 internally because applications may need to compare
data against Unicode code points.  If I must use Unicode at all, that
is.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-16 15:36               ` Oliver Scholz
  2003-08-16 17:14                 ` Reiner Steib
  2003-08-16 17:23                 ` Simon Josefsson
@ 2003-08-17  0:57                 ` Jesper Harder
  2003-08-17 17:24                   ` Oliver Scholz
  2 siblings, 1 reply; 37+ messages in thread
From: Jesper Harder @ 2003-08-17  0:57 UTC (permalink / raw)

Oliver Scholz <alkibiades@gmx.de> writes:

> Jesper Harder <harder@myrealbox.com> writes:
>
>> But the purpose of email is to _communicate_.  Why lower you chance
>> of cummunicating if there is no compelling technical reason to do
>> so?
>
> Now and then I realize in German Usenet, that a few people who post
> replies to my articles can not deal with UTF-8, because when they
> quote the text I wrote, I see funny characters instead of umlauts.
> This is not a big impediment to communication.

It is a big impediment, believe me.  A long time ago I used read
Usenet by TELNETTing from a Norsk Data terminal to an overloaded
Ultrix box.  Needless to say this setup could not display any 8bit
characters (the eight bit was stripped).  Reading Danish was so
annoying that I didn't use dk.* for many years.

Also remember that not everyone can say "Okay, I'll just upgrade to
something Unicode-capable".  If you're using a shared system you
probably don't have the power to decide that.

> If we don't move to Unicode in the future then coding system
> problems will go on forever and ever.

It would be foolish not to use Unicode for any _new_ protocols or
formats.  But for legacy systems like email and Usenet backward
compatibility is really, really important.  If you look at how
e.g. MIME or format=flowed was designed, you'll see that a lot of
effort and thought was spent on minimizing negative effects for
existing clients.

You need an especially good excuse to break existing stuff.  The fact
that Unicode is a technically more pleasing solution just isn't a good
enough reason to break things unnecessarily, IMHO.

But if you're doing something that wasn't possible before, say, using
German and Thai in the same message, that's a valid reason to use
Unicode.

> My guess -- by the way -- is that Unicode will become increasingly
> important in Europe, especially for the members of the EU. We'd need
> at least Latin-1/Latin-9, Latin-2 and Greek (ISO 8859-7). And I am not
> sure if that already covers Latvian, Romanian and others. There will
> be a growing need for an encoding that covers all of these languages.

I think most Western European users don't care about and don't know
how to access any glyph that isn't printed on the keyboard.

>> 2. Unicode support itself doesn't really buy me a lot if most people
>>    don't have fairly complete Unicode fonts (which they don't).
>
> So the worst thing that could happen is that they see a hollow box now
> and then.

An empty box can be bad enough.  If you're writing an equation it can
be really important what that empty box happens to be ☺ I experienced
that problem recently when I used ℏ in a message.

> And yet some characters are more frequent than others. You can
> probably rely on the fact that western Europeans have fonts that
> contain the Latin-1 repertoire. Box drawing characters or symbols
> may not be that frequent, but there is a good chance to get the
> additional punctuation characters.

In practice the only thing you can reasonably expect are the 650
glyphs in WGL4.¹

¹ http://partners.adobe.com/asn/tech/type/opentype/appendices/wgl4.jsp

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-16 22:24                     ` Simon Josefsson
@ 2003-08-17 12:30                       ` Benjamin Riefenstahl
  2003-08-17 16:40                         ` Oliver Scholz
  2003-08-18  2:16                       ` James H. Cloos Jr.
  1 sibling, 1 reply; 37+ messages in thread
From: Benjamin Riefenstahl @ 2003-08-17 12:30 UTC (permalink / raw)


Hi Simon,


Just two additional thoughts, I agree with most of what you said
otherwise.

Simon Josefsson <jas@extundo.com> writes:
> But it doesn't solve the problem. 'cmp' still says the files are
> different.  UTF-8 had a similar problem (overlong encodings) but
> that has been fixed, UTF-16 and UTF-32 can't be.

Actually UTF-8 still has that problem with composed vs. decomposed
characters.  There is no perfect system AFAIK.

> If normal computers was 16 bit, I could understand the trade-off,

Depends of what you call "normal computers."  MS Windows and Apple's
Mac OS X both use UTF-16 for APIs and internal implmentation.


benny




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-17 12:30                       ` Benjamin Riefenstahl
@ 2003-08-17 16:40                         ` Oliver Scholz
  2003-08-18  2:20                           ` James H. Cloos Jr.
  2003-08-18 15:58                           ` Benjamin Riefenstahl
  0 siblings, 2 replies; 37+ messages in thread
From: Oliver Scholz @ 2003-08-17 16:40 UTC (permalink / raw)

Benjamin Riefenstahl <Benjamin.Riefenstahl@epost.de> writes:
[...]
> Simon Josefsson <jas@extundo.com> writes:
>> But it doesn't solve the problem. 'cmp' still says the files are
>> different.  UTF-8 had a similar problem (overlong encodings) but
>> that has been fixed, UTF-16 and UTF-32 can't be.
>
> Actually UTF-8 still has that problem with composed vs. decomposed
> characters.  There is no perfect system AFAIK.

Just to be sure that I understand you correctly: Do you refer to the
fact here that a character like, say, U+00E9 (LATIN SMALL LETTER E
WITH ACUTE) is equivalent to U+0065 followed by U+0301 (LATIN SMALL
LETTER E followed by COMBINING ACUTE ACCENT)?

>> If normal computers was 16 bit, I could understand the trade-off,
>
> Depends of what you call "normal computers."  MS Windows and Apple's
> Mac OS X both use UTF-16 for APIs and internal implmentation.
[...]

I am not sure, but I think that the characters that need to be
accessed via surrogate pairs are meant to be rare, since they are
outside of the BMP. So AFAIK UTF-16 is meant as a space-efficient
format for East Asian text. But as I said: this is outside the scope
of things with which I have normally to deal with.

    Oliver
-- 
30 Thermidor an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-17  0:57                 ` Jesper Harder
@ 2003-08-17 17:24                   ` Oliver Scholz
  2003-08-17 18:21                     ` Matthias Andree
  0 siblings, 1 reply; 37+ messages in thread
From: Oliver Scholz @ 2003-08-17 17:24 UTC (permalink / raw)

Jesper Harder <harder@myrealbox.com> writes:
[...]
> It would be foolish not to use Unicode for any _new_ protocols or
> formats.  But for legacy systems like email and Usenet backward
> compatibility is really, really important.  If you look at how
> e.g. MIME or format=flowed was designed, you'll see that a lot of
> effort and thought was spent on minimizing negative effects for
> existing clients.
>
> You need an especially good excuse to break existing stuff.  The fact
> that Unicode is a technically more pleasing solution just isn't a good
> enough reason to break things unnecessarily, IMHO.
[...]

I have to admit that this is a very strong argument. It could
probably convince me, if the situation in Usenet were not already
such a mess. I agree that it is sometimes a good thing to preserve a
current working state in order to maximize compatibility. But
sometimes it is a good thing to dare a reform. Which is the case for
Usenet is probably a matter of estimation. I think I have stated most
of my arguments.

At least I shouldn't smile upon people anymore who use plain ASCII in
the de.* hierarchy. One could probably rather convince me to use ASCII
than to use, say, Latin-9.

> > My guess -- by the way -- is that Unicode will become increasingly
> > important in Europe, especially for the members of the EU. We'd need
> > at least Latin-1/Latin-9, Latin-2 and Greek (ISO 8859-7). And I am not
> > sure if that already covers Latvian, Romanian and others. There will
> > be a growing need for an encoding that covers all of these languages.

> I think most Western European users don't care about and don't know
> how to access any glyph that isn't printed on the keyboard.

My guess is that the usage of UTF-8 in Europe will start in business
e-eail and spread from there. But maybe this is not my actual
point. It's rather that I want it to be easy to mix different
languages freely. Why shouldn't a Pole or a Chinese posting a German
message to the de.* hierarchy sign with his or her Chinese of Polish
name? Why not a Greek verse in the signature? Or an Arabian proverb?
Many people would't do it, unless they can be sure that it wouldn't
garble their umlauts for some people, however small or great their
number may be. This is decent, but I also find it suboptimal. It won't
change, until UTF-8 becomes the default.

    Oliver
-- 
30 Thermidor an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-16  7:43     ` Ivan Boldyrev
@ 2003-08-17 17:27       ` Oliver Scholz
  0 siblings, 0 replies; 37+ messages in thread
From: Oliver Scholz @ 2003-08-17 17:27 UTC (permalink / raw)


Ivan Boldyrev <boldyrev+nospam@cgitftp.uiggm.nsc.ru> writes:

> On 8472 day of my life Oliver Scholz wrote:
>>> Perhaps some day we can try ASCII first, then fall back to UTF-8.  But
>>> that will take a long time.  Even moving to ISO-8859-1 in northern
>>> Europe took a long time, and still isn't finished.  I still use IBMPC2
>>> (CP437?) in some regional communication channels.
>> [...]
>>
>> I think it is in general a good idea to choose the encoding according
>> to the audience. Fortunately this is not hard with Gnus. There are
>> some people to which I send my mail in Latin-1.
>
> Do you use special group for them or do something more tricky?

I started to use the BBDB recently and I added a special property for
people that should receive mail in an encoding other than UTF-8, like:

egoge-encoding: latin-1

Before that I kept their Email-addresses in a list, the mechanics are
similar then.

But I am not sure whether a defadvice around `message-send-and-exit'
is the best way to do this.

(defadvice message-send-and-exit (around
				  egoge-latin-1-friendly
				  activate)
  "Query the BBDB for a preferred encoding for this message."
  (let* ((address (message-fetch-field "to"))
	 (encoding (and address
			(egoge-bbdb-get-prop
			 (cadr (gnus-extract-address-components
				address))
			 'egoge-encoding))))
    (if (not encoding)
	ad-do-it
      (let ((mm-coding-system-priorities
	     (cons (intern encoding) mm-coding-system-priorities)))
	ad-do-it))))

(defun egoge-bbdb-get-prop (address property)
  (let ((record (car (egoge-bbdb-find-address address))))
    (and record
	 (bbdb-record-getprop record property))))

(defun egoge-bbdb-find-address (address)
  "Return BBDB records which contain ADDRESS as net-address.
Return nil if there is no such record."
  (bbdb-search (bbdb-records t) nil nil address))


    Oliver
-- 
30 Thermidor an 211 de la Révolution
Liberté, Egalité, Fraternité!




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-17 17:24                   ` Oliver Scholz
@ 2003-08-17 18:21                     ` Matthias Andree
  0 siblings, 0 replies; 37+ messages in thread
From: Matthias Andree @ 2003-08-17 18:21 UTC (permalink / raw)

Oliver Scholz <alkibiades@gmx.de> writes:

> I have to admit that this is a very strong argument. It could
> probably convince me, if the situation in Usenet were not already
> such a mess. I agree that it is sometimes a good thing to preserve a
> current working state in order to maximize compatibility. But
> sometimes it is a good thing to dare a reform. Which is the case for
> Usenet is probably a matter of estimation. I think I have stated most
> of my arguments.

RFC violations in Usenet are commonplace in Northern Europe and Germany
anyways. dk.* and no.* users complained when leafnode didn't accept
their unencoded 8-bit headers. (At least, there's no newsgroup such as
no.østfold - which Arnt Gulbrandsen, original author of leafnode, was
concerned about.)

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-16 17:23                 ` Simon Josefsson
  2003-08-16 19:18                   ` Oliver Scholz
@ 2003-08-18  2:09                   ` James H. Cloos Jr.
  2003-08-28 13:38                     ` Jens Müller
  2003-08-28 13:35                   ` Jens Müller
  2 siblings, 1 reply; 37+ messages in thread
From: James H. Cloos Jr. @ 2003-08-18  2:09 UTC (permalink / raw)


>>>>> "Simon" == Simon Josefsson <jas@extundo.com> writes:

Simon> Wasn't the Klingon proposal for Unicode rejected?  Tengwar has
Simon> been a proposal for ten years, or so, and nothing has happend,
Simon> as far as I know.

Klingon was rejected because it is a made-up script/language.
Tengwar is unlikely to be accepted for the same reason.

That said, Tengwar has been singled out by some as a great script to
use as the basis for a document describing how to properly support
complex scripts.  (It is probably about as complex to render as
arabic, urdu, etc.  At least based on recent comments on the
unicode list.)

-JimC




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-16 22:24                     ` Simon Josefsson
  2003-08-17 12:30                       ` Benjamin Riefenstahl
@ 2003-08-18  2:16                       ` James H. Cloos Jr.
  1 sibling, 0 replies; 37+ messages in thread
From: James H. Cloos Jr. @ 2003-08-18  2:16 UTC (permalink / raw)


>>>>> "Simon" == Simon Josefsson <jas@extundo.com> writes:

Simon> The disadvantage with UTF-8 is that you don't know where a code
Simon> value ends within the encoded data without knowledge of UTF-8,

[ed's note: this should be taken as an extension of Simon's point,
            not a counter-argument.  It seemed ambiguous w/o
            a disclaimer....  -JimC]

That isn't really a disadvantage, since you need knowledge of unicode
itself anyway:  not every unit fits in a single code point.  Combining
characters, variation selectors, et al all mean that even with utf32
there is no guarentee that you can split at any given int32, hense
the fact that utf8 cannot be split at any given int8 is irrelevant.

-JimC




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-17 16:40                         ` Oliver Scholz
@ 2003-08-18  2:20                           ` James H. Cloos Jr.
  2003-08-18 15:58                           ` Benjamin Riefenstahl
  1 sibling, 0 replies; 37+ messages in thread
From: James H. Cloos Jr. @ 2003-08-18  2:20 UTC (permalink / raw)


>>>>> "os" == Oliver Scholz <alkibiades@gmx.de> writes:

os> So AFAIK UTF-16 is meant as a space-efficient
os> format for East Asian text.

Actually utf16 is meant to be backwards compatable with the earlier
adopters of unicode -- back when it was a 16 bit standard -- who
were using ucs2.  That it also tends to use fewer bits for most of
the CJK characters is a later issue, AIUI.

-JimC





^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-15 17:00   ` Oliver Scholz
  2003-08-16  7:43     ` Ivan Boldyrev
@ 2003-08-18  6:01     ` Steinar Bang
  1 sibling, 0 replies; 37+ messages in thread
From: Steinar Bang @ 2003-08-18  6:01 UTC (permalink / raw)


>>>>> Oliver Scholz <alkibiades@gmx.de>:

[snip!]
> But I do think that *some* people should start to use UTF-8 as a
> default.

Power to you then.

But I suspect you will get many similar responses to those I got ten
years ago, when I started using quoted-unreadable in email (hey! it
was the standard...! :-) )




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-17 16:40                         ` Oliver Scholz
  2003-08-18  2:20                           ` James H. Cloos Jr.
@ 2003-08-18 15:58                           ` Benjamin Riefenstahl
  1 sibling, 0 replies; 37+ messages in thread
From: Benjamin Riefenstahl @ 2003-08-18 15:58 UTC (permalink / raw)


Hi Oliver,


>> Simon Josefsson <jas@extundo.com> writes:
>>> 'cmp' still says the files are different.

> Benjamin Riefenstahl <Benjamin.Riefenstahl@epost.de> writes:
>> Actually UTF-8 still has that problem with composed vs. decomposed
>> characters.  There is no perfect system AFAIK.

Oliver Scholz <alkibiades@gmx.de> writes:
> Do you refer to the fact here that a character like, say, U+00E9
> (LATIN SMALL LETTER E WITH ACUTE) is equivalent to U+0065 followed
> by U+0301 (LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT)?

Yes.

> So AFAIK UTF-16 is meant as a space-efficient format for East Asian
> text.

That and compatibility.  The first Unicode versions talked much about
the 16-bit representation and the most wide-spread users (Windows NT,
COM, VFAT, HFS+) implemented it like that.


benny




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-16 17:14                 ` Reiner Steib
  2003-08-16 19:29                   ` Oliver Scholz
@ 2003-08-19 14:54                   ` Miles Bader
  2003-08-20 15:24                     ` Reiner Steib
  1 sibling, 1 reply; 37+ messages in thread
From: Miles Bader @ 2003-08-19 14:54 UTC (permalink / raw)


Reiner Steib <4.uce.03.r.s@nurfuerspam.de> writes:
> > The Unicode support for Emacs is quite good; there may be issues
> > with CJK in the current released version of Emacs, but the rest
> > works fine.
> 
> Not only in the released versions, see this thread on emacs-devel:
> <URL:http://article.gmane.org/gmane.emacs.devel/13487>.

Did you try turning on `utf-translate-cjk-mode' (in CVS emacs)?

It enables UTF-8 CJK support.

-Miles
-- 
We live, as we dream -- alone....




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-19 14:54                   ` Miles Bader
@ 2003-08-20 15:24                     ` Reiner Steib
  2003-08-21  0:20                       ` Miles Bader
  0 siblings, 1 reply; 37+ messages in thread
From: Reiner Steib @ 2003-08-20 15:24 UTC (permalink / raw)


On Tue, Aug 19 2003, Miles Bader wrote:

> Reiner Steib <4.uce.03.r.s@nurfuerspam.de> writes:
>> Not only in the released versions, see this thread on emacs-devel:
>> <URL:http://article.gmane.org/gmane.emacs.devel/13487>.
>
> Did you try turning on `utf-translate-cjk-mode' (in CVS emacs)?

No, since I don't need CJK myself (and usually use Emacs 21.3).

> It enables UTF-8 CJK support.

But UTF support in CVS (HEAD) is not complete yet (as describe in the
abovementioned thread), is it?

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo--- PGP key available via WWW   http://rsteib.home.pages.de/




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-20 15:24                     ` Reiner Steib
@ 2003-08-21  0:20                       ` Miles Bader
  0 siblings, 0 replies; 37+ messages in thread
From: Miles Bader @ 2003-08-21  0:20 UTC (permalink / raw)

Reiner Steib <4.uce.03.r.s@nurfuerspam.de> writes:
> > It enables UTF-8 CJK support.
> 
> But UTF support in CVS (HEAD) is not complete yet (as describe in the
> abovementioned thread), is it?

I don't know what you mean by `complete'*, but as far as I know the
above-mentioned CJK support was the main big omission.  It's not turned
by default because it loads some big lisp files to do the mappings.

* I suppose there will always be small differences, until the real emacs
  unicode branch becomes official (which should reasonably soon I think).

-miles
-- 
We have met the enemy... and he is us.  -- Pogo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-16 17:23                 ` Simon Josefsson
  2003-08-16 19:18                   ` Oliver Scholz
  2003-08-18  2:09                   ` James H. Cloos Jr.
@ 2003-08-28 13:35                   ` Jens Müller
  2 siblings, 0 replies; 37+ messages in thread
From: Jens Müller @ 2003-08-28 13:35 UTC (permalink / raw)


Simon Josefsson <jas@extundo.com> writes:

> Wasn't the Klingon proposal for Unicode rejected?

Yepp. Not suitable for encoding.

The current Klingon characters are just other presentation forms for
letters from the Latin alphabet.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Gnus: UTF-8 and compatibility with other MUAs
  2003-08-18  2:09                   ` James H. Cloos Jr.
@ 2003-08-28 13:38                     ` Jens Müller
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Müller @ 2003-08-28 13:38 UTC (permalink / raw)


"James H. Cloos Jr." <cloos@jhcloos.com> writes:

> Klingon was rejected because it is a made-up script/language.
> Tengwar is unlikely to be accepted for the same reason.

And why does the roadmap then talk about scripts for artificial
languages?

No, that was probably not the reason.



^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2003-08-28 13:38 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-08-14 15:48 Gnus: UTF-8 and compatibility with other MUAs Xavier Maillard
2003-08-14 22:39 ` Frank Schmitt
2003-08-15 18:22   ` Xavier Maillard
2003-08-14 23:01 ` Jesper Harder
2003-08-15 13:50   ` Oliver Scholz
2003-08-15 16:48     ` Jesper Harder
2003-08-15 18:10       ` Oliver Scholz
2003-08-16  0:23         ` Jesper Harder
2003-08-16  9:48           ` Oliver Scholz
2003-08-16 13:01             ` Jesper Harder
2003-08-16 15:36               ` Oliver Scholz
2003-08-16 17:14                 ` Reiner Steib
2003-08-16 19:29                   ` Oliver Scholz
2003-08-19 14:54                   ` Miles Bader
2003-08-20 15:24                     ` Reiner Steib
2003-08-21  0:20                       ` Miles Bader
2003-08-16 17:23                 ` Simon Josefsson
2003-08-16 19:18                   ` Oliver Scholz
2003-08-16 22:24                     ` Simon Josefsson
2003-08-17 12:30                       ` Benjamin Riefenstahl
2003-08-17 16:40                         ` Oliver Scholz
2003-08-18  2:20                           ` James H. Cloos Jr.
2003-08-18 15:58                           ` Benjamin Riefenstahl
2003-08-18  2:16                       ` James H. Cloos Jr.
2003-08-18  2:09                   ` James H. Cloos Jr.
2003-08-28 13:38                     ` Jens Müller
2003-08-28 13:35                   ` Jens Müller
2003-08-17  0:57                 ` Jesper Harder
2003-08-17 17:24                   ` Oliver Scholz
2003-08-17 18:21                     ` Matthias Andree
2003-08-15 18:24   ` Xavier Maillard
2003-08-16  0:35     ` Jesper Harder
2003-08-14 23:05 ` Simon Josefsson
2003-08-15 17:00   ` Oliver Scholz
2003-08-16  7:43     ` Ivan Boldyrev
2003-08-17 17:27       ` Oliver Scholz
2003-08-18  6:01     ` Steinar Bang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).