From: Simon Josefsson <jas@extundo.com>
Cc: ding@gnus.org
Subject: Re: Gnus: UTF-8 and compatibility with other MUAs
Date: Sun, 17 Aug 2003 00:24:13 +0200 [thread overview]
Message-ID: <ilulltt8cmq.fsf@latte.josefsson.org> (raw)
In-Reply-To: <uvfsx5s2s.fsf@ID-87814.user.dfncis.de> (Oliver Scholz's message of "Sat, 16 Aug 2003 21:18:51 +0200")
Oliver Scholz <alkibiades@gmx.de> writes:
> [...]
>> UTF-16? It's not even a well define encoding scheme, two files may
>> contain the exact same Unicode code points, but may differ in a binary
>> comparison, due to byte ordering.
>
> That's what the byte order mark is for.
But it doesn't solve the problem. 'cmp' still says the files are
different. UTF-8 had a similar problem (overlong encodings) but that
has been fixed, UTF-16 and UTF-32 can't be.
>> And concatenating two UTF-16 strings from different sources requires
>> knowledge about the encoding. And surrogate pairs complicate matters
>> as well.
>
> Why do you think that surrogate pairs complicate matters? There can't
> be any confusion whether an arbitrary 16 bit value is part a surrogate
> pair or not; and if it is, whether it is the higher surrogate or the
> lower one.
One way to realize it is to compare UTF-16 with either UTF-8 or
UTF-32. The surrogate pair construction make UTF-16 contain the
disadvantage of both UTF-8 and UTF-32, but none of their advantage.
The disadvantage with UTF-8 is that you don't know where a code value
ends within the encoded data without knowledge of UTF-8, and the
disadvantage with UTF-32 is that it wastes space since most data fit
in 16 bits or less.
If normal computers was 16 bit, I could understand the trade-off, but
with 32 bit (or more) machines you can remove one of the disadvantages
by choosing either UTF-8 or UTF-32 instead of UTF-16.
> As for concatenating I'd say this depends on whether the tools are
> able to deal with it.
Right, and many tools assume that if you receive two binary blobs A
and B which are said to contain text, you can form the concatenation
of the text by concatenating the binary blobs as A||B. This is a
reasonable assumption, and it works for most encoding schemes,
including UTF-8. It doesn't work for UTF-16 or UTF-32.
My preference is to use UTF-8 when data is stored or transfered, and
only use UTF-32 internally because applications may need to compare
data against Unicode code points. If I must use Unicode at all, that
is.
next prev parent reply other threads:[~2003-08-16 22:24 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2003-08-14 15:48 Xavier Maillard
2003-08-14 22:39 ` Frank Schmitt
2003-08-15 18:22 ` Xavier Maillard
2003-08-14 23:01 ` Jesper Harder
2003-08-15 13:50 ` Oliver Scholz
2003-08-15 16:48 ` Jesper Harder
2003-08-15 18:10 ` Oliver Scholz
2003-08-16 0:23 ` Jesper Harder
2003-08-16 9:48 ` Oliver Scholz
2003-08-16 13:01 ` Jesper Harder
2003-08-16 15:36 ` Oliver Scholz
2003-08-16 17:14 ` Reiner Steib
2003-08-16 19:29 ` Oliver Scholz
2003-08-19 14:54 ` Miles Bader
2003-08-20 15:24 ` Reiner Steib
2003-08-21 0:20 ` Miles Bader
2003-08-16 17:23 ` Simon Josefsson
2003-08-16 19:18 ` Oliver Scholz
2003-08-16 22:24 ` Simon Josefsson [this message]
2003-08-17 12:30 ` Benjamin Riefenstahl
2003-08-17 16:40 ` Oliver Scholz
2003-08-18 2:20 ` James H. Cloos Jr.
2003-08-18 15:58 ` Benjamin Riefenstahl
2003-08-18 2:16 ` James H. Cloos Jr.
2003-08-18 2:09 ` James H. Cloos Jr.
2003-08-28 13:38 ` Jens Müller
2003-08-28 13:35 ` Jens Müller
2003-08-17 0:57 ` Jesper Harder
2003-08-17 17:24 ` Oliver Scholz
2003-08-17 18:21 ` Matthias Andree
2003-08-15 18:24 ` Xavier Maillard
2003-08-16 0:35 ` Jesper Harder
2003-08-14 23:05 ` Simon Josefsson
2003-08-15 17:00 ` Oliver Scholz
2003-08-16 7:43 ` Ivan Boldyrev
2003-08-17 17:27 ` Oliver Scholz
2003-08-18 6:01 ` Steinar Bang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ilulltt8cmq.fsf@latte.josefsson.org \
--to=jas@extundo.com \
--cc=ding@gnus.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).