From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/53745
Path: main.gmane.org!not-for-mail
From: Simon Josefsson <jas@extundo.com>
Newsgroups: gmane.emacs.gnus.general
Subject: Re: Gnus: UTF-8 and compatibility with other MUAs
Date: Sun, 17 Aug 2003 00:24:13 +0200
Sender: ding-owner@lists.math.uh.edu
Message-ID: <ilulltt8cmq.fsf@latte.josefsson.org>
References: <plop87brus6y07.fsf@gnu-rox.org>
	<m3oeyrkfnc.fsf@defun.localdomain>
	<ubruruj1i.fsf@ID-87814.user.dfncis.de>
	<m3isoy50kt.fsf@defun.localdomain>
	<uhe4in64v.fsf@ID-87814.user.dfncis.de>
	<m3u18i30xi.fsf@defun.localdomain>
	<u8ypulyqs.fsf@ID-87814.user.dfncis.de>
	<m3ada9vjrl.fsf@defun.localdomain>
	<uoeyp62cy.fsf@ID-87814.user.dfncis.de>
	<ilu65kxa54l.fsf@latte.josefsson.org>
	<uvfsx5s2s.fsf@ID-87814.user.dfncis.de>
NNTP-Posting-Host: deer.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: sea.gmane.org 1061072727 27708 80.91.224.253 (16 Aug 2003 22:25:27 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Sat, 16 Aug 2003 22:25:27 +0000 (UTC)
Cc: ding@gnus.org
Original-X-From: ding-owner+M2287@lists.math.uh.edu Sun Aug 17 00:25:25 2003
Return-path: <ding-owner+M2287@lists.math.uh.edu>
Original-Received: from malifon.math.uh.edu ([129.7.128.13])
	by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 19o9UD-0004Ik-00
	for <ding-account@gmane.org>; Sun, 17 Aug 2003 00:25:25 +0200
Original-Received: from localhost ([127.0.0.1] helo=lists.math.uh.edu)
	by malifon.math.uh.edu with smtp (Exim 3.20 #1)
	id 19o9TB-0005L1-00; Sat, 16 Aug 2003 17:24:21 -0500
Original-Received: from sclp3.sclp.com ([64.157.176.121])
	by malifon.math.uh.edu with smtp (Exim 3.20 #1)
	id 19o9T7-0005Kw-00
	for ding@lists.math.uh.edu; Sat, 16 Aug 2003 17:24:17 -0500
Original-Received: (qmail 73516 invoked by alias); 16 Aug 2003 22:24:17 -0000
Original-Received: (qmail 73510 invoked from network); 16 Aug 2003 22:24:17 -0000
Original-Received: from 178.230.13.217.in-addr.dgcsystems.net (HELO yxa.extundo.com) (217.13.230.178)
  by sclp3.sclp.com with SMTP; 16 Aug 2003 22:24:17 -0000
Original-Received: from latte.josefsson.org (yxa.extundo.com [217.13.230.178])
	(authenticated bits=0)
	by yxa.extundo.com (8.12.9/8.12.9) with ESMTP id h7GMODdk032218
	(version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=OK);
	Sun, 17 Aug 2003 00:24:14 +0200
Original-To: Oliver Scholz <alkibiades@gmx.de>
Mail-Copies-To: nobody
X-Payment: hashcash 1.2 0:030816:alkibiades@gmx.de:dcdaae38dc15cb53
X-Hashcash: 0:030816:alkibiades@gmx.de:dcdaae38dc15cb53
X-Payment: hashcash 1.2 0:030816:ding@gnus.org:28bbd94f312bfe62
X-Hashcash: 0:030816:ding@gnus.org:28bbd94f312bfe62
In-Reply-To: <uvfsx5s2s.fsf@ID-87814.user.dfncis.de> (Oliver Scholz's
 message of "Sat, 16 Aug 2003 21:18:51 +0200")
User-Agent: Gnus/5.1003 (Gnus v5.10.3) Emacs/21.3.50 (gnu/linux)
Precedence: bulk
Xref: main.gmane.org gmane.emacs.gnus.general:53745
X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:53745

Oliver Scholz <alkibiades@gmx.de> writes:

> [...]
>> UTF-16?  It's not even a well define encoding scheme, two files may
>> contain the exact same Unicode code points, but may differ in a binary
>> comparison, due to byte ordering.  
>
> That's what the byte order mark is for.

But it doesn't solve the problem. 'cmp' still says the files are
different.  UTF-8 had a similar problem (overlong encodings) but that
has been fixed, UTF-16 and UTF-32 can't be.

>> And concatenating two UTF-16 strings from different sources requires
>> knowledge about the encoding. And surrogate pairs complicate matters
>> as well.
>
> Why do you think that surrogate pairs complicate matters? There can't
> be any confusion whether an arbitrary 16 bit value is part a surrogate
> pair or not; and if it is, whether it is the higher surrogate or the
> lower one.

One way to realize it is to compare UTF-16 with either UTF-8 or
UTF-32.  The surrogate pair construction make UTF-16 contain the
disadvantage of both UTF-8 and UTF-32, but none of their advantage.

The disadvantage with UTF-8 is that you don't know where a code value
ends within the encoded data without knowledge of UTF-8, and the
disadvantage with UTF-32 is that it wastes space since most data fit
in 16 bits or less.

If normal computers was 16 bit, I could understand the trade-off, but
with 32 bit (or more) machines you can remove one of the disadvantages
by choosing either UTF-8 or UTF-32 instead of UTF-16.

> As for concatenating I'd say this depends on whether the tools are
> able to deal with it.

Right, and many tools assume that if you receive two binary blobs A
and B which are said to contain text, you can form the concatenation
of the text by concatenating the binary blobs as A||B.  This is a
reasonable assumption, and it works for most encoding schemes,
including UTF-8.  It doesn't work for UTF-16 or UTF-32.

My preference is to use UTF-8 when data is stored or transfered, and
only use UTF-32 internally because applications may need to compare
data against Unicode code points.  If I must use Unicode at all, that
is.