Gnus development mailing list
 help / color / mirror / Atom feed
From: Simon Josefsson <jas@extundo.com>
Cc: ding@gnus.org
Subject: Re: Gnus: UTF-8 and compatibility with other MUAs
Date: Sun, 17 Aug 2003 00:24:13 +0200	[thread overview]
Message-ID: <ilulltt8cmq.fsf@latte.josefsson.org> (raw)
In-Reply-To: <uvfsx5s2s.fsf@ID-87814.user.dfncis.de> (Oliver Scholz's message of "Sat, 16 Aug 2003 21:18:51 +0200")

Oliver Scholz <alkibiades@gmx.de> writes:

> [...]
>> UTF-16?  It's not even a well define encoding scheme, two files may
>> contain the exact same Unicode code points, but may differ in a binary
>> comparison, due to byte ordering.  
>
> That's what the byte order mark is for.

But it doesn't solve the problem. 'cmp' still says the files are
different.  UTF-8 had a similar problem (overlong encodings) but that
has been fixed, UTF-16 and UTF-32 can't be.

>> And concatenating two UTF-16 strings from different sources requires
>> knowledge about the encoding. And surrogate pairs complicate matters
>> as well.
>
> Why do you think that surrogate pairs complicate matters? There can't
> be any confusion whether an arbitrary 16 bit value is part a surrogate
> pair or not; and if it is, whether it is the higher surrogate or the
> lower one.

One way to realize it is to compare UTF-16 with either UTF-8 or
UTF-32.  The surrogate pair construction make UTF-16 contain the
disadvantage of both UTF-8 and UTF-32, but none of their advantage.

The disadvantage with UTF-8 is that you don't know where a code value
ends within the encoded data without knowledge of UTF-8, and the
disadvantage with UTF-32 is that it wastes space since most data fit
in 16 bits or less.

If normal computers was 16 bit, I could understand the trade-off, but
with 32 bit (or more) machines you can remove one of the disadvantages
by choosing either UTF-8 or UTF-32 instead of UTF-16.

> As for concatenating I'd say this depends on whether the tools are
> able to deal with it.

Right, and many tools assume that if you receive two binary blobs A
and B which are said to contain text, you can form the concatenation
of the text by concatenating the binary blobs as A||B.  This is a
reasonable assumption, and it works for most encoding schemes,
including UTF-8.  It doesn't work for UTF-16 or UTF-32.

My preference is to use UTF-8 when data is stored or transfered, and
only use UTF-32 internally because applications may need to compare
data against Unicode code points.  If I must use Unicode at all, that
is.




  reply	other threads:[~2003-08-16 22:24 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-08-14 15:48 Xavier Maillard
2003-08-14 22:39 ` Frank Schmitt
2003-08-15 18:22   ` Xavier Maillard
2003-08-14 23:01 ` Jesper Harder
2003-08-15 13:50   ` Oliver Scholz
2003-08-15 16:48     ` Jesper Harder
2003-08-15 18:10       ` Oliver Scholz
2003-08-16  0:23         ` Jesper Harder
2003-08-16  9:48           ` Oliver Scholz
2003-08-16 13:01             ` Jesper Harder
2003-08-16 15:36               ` Oliver Scholz
2003-08-16 17:14                 ` Reiner Steib
2003-08-16 19:29                   ` Oliver Scholz
2003-08-19 14:54                   ` Miles Bader
2003-08-20 15:24                     ` Reiner Steib
2003-08-21  0:20                       ` Miles Bader
2003-08-16 17:23                 ` Simon Josefsson
2003-08-16 19:18                   ` Oliver Scholz
2003-08-16 22:24                     ` Simon Josefsson [this message]
2003-08-17 12:30                       ` Benjamin Riefenstahl
2003-08-17 16:40                         ` Oliver Scholz
2003-08-18  2:20                           ` James H. Cloos Jr.
2003-08-18 15:58                           ` Benjamin Riefenstahl
2003-08-18  2:16                       ` James H. Cloos Jr.
2003-08-18  2:09                   ` James H. Cloos Jr.
2003-08-28 13:38                     ` Jens Müller
2003-08-28 13:35                   ` Jens Müller
2003-08-17  0:57                 ` Jesper Harder
2003-08-17 17:24                   ` Oliver Scholz
2003-08-17 18:21                     ` Matthias Andree
2003-08-15 18:24   ` Xavier Maillard
2003-08-16  0:35     ` Jesper Harder
2003-08-14 23:05 ` Simon Josefsson
2003-08-15 17:00   ` Oliver Scholz
2003-08-16  7:43     ` Ivan Boldyrev
2003-08-17 17:27       ` Oliver Scholz
2003-08-18  6:01     ` Steinar Bang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ilulltt8cmq.fsf@latte.josefsson.org \
    --to=jas@extundo.com \
    --cc=ding@gnus.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).