From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/53745 Path: main.gmane.org!not-for-mail From: Simon Josefsson Newsgroups: gmane.emacs.gnus.general Subject: Re: Gnus: UTF-8 and compatibility with other MUAs Date: Sun, 17 Aug 2003 00:24:13 +0200 Sender: ding-owner@lists.math.uh.edu Message-ID: References: NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1061072727 27708 80.91.224.253 (16 Aug 2003 22:25:27 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Sat, 16 Aug 2003 22:25:27 +0000 (UTC) Cc: ding@gnus.org Original-X-From: ding-owner+M2287@lists.math.uh.edu Sun Aug 17 00:25:25 2003 Return-path: Original-Received: from malifon.math.uh.edu ([129.7.128.13]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 19o9UD-0004Ik-00 for ; Sun, 17 Aug 2003 00:25:25 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.math.uh.edu) by malifon.math.uh.edu with smtp (Exim 3.20 #1) id 19o9TB-0005L1-00; Sat, 16 Aug 2003 17:24:21 -0500 Original-Received: from sclp3.sclp.com ([64.157.176.121]) by malifon.math.uh.edu with smtp (Exim 3.20 #1) id 19o9T7-0005Kw-00 for ding@lists.math.uh.edu; Sat, 16 Aug 2003 17:24:17 -0500 Original-Received: (qmail 73516 invoked by alias); 16 Aug 2003 22:24:17 -0000 Original-Received: (qmail 73510 invoked from network); 16 Aug 2003 22:24:17 -0000 Original-Received: from 178.230.13.217.in-addr.dgcsystems.net (HELO yxa.extundo.com) (217.13.230.178) by sclp3.sclp.com with SMTP; 16 Aug 2003 22:24:17 -0000 Original-Received: from latte.josefsson.org (yxa.extundo.com [217.13.230.178]) (authenticated bits=0) by yxa.extundo.com (8.12.9/8.12.9) with ESMTP id h7GMODdk032218 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=OK); Sun, 17 Aug 2003 00:24:14 +0200 Original-To: Oliver Scholz Mail-Copies-To: nobody X-Payment: hashcash 1.2 0:030816:alkibiades@gmx.de:dcdaae38dc15cb53 X-Hashcash: 0:030816:alkibiades@gmx.de:dcdaae38dc15cb53 X-Payment: hashcash 1.2 0:030816:ding@gnus.org:28bbd94f312bfe62 X-Hashcash: 0:030816:ding@gnus.org:28bbd94f312bfe62 In-Reply-To: (Oliver Scholz's message of "Sat, 16 Aug 2003 21:18:51 +0200") User-Agent: Gnus/5.1003 (Gnus v5.10.3) Emacs/21.3.50 (gnu/linux) Precedence: bulk Xref: main.gmane.org gmane.emacs.gnus.general:53745 X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:53745 Oliver Scholz writes: > [...] >> UTF-16? It's not even a well define encoding scheme, two files may >> contain the exact same Unicode code points, but may differ in a binary >> comparison, due to byte ordering. > > That's what the byte order mark is for. But it doesn't solve the problem. 'cmp' still says the files are different. UTF-8 had a similar problem (overlong encodings) but that has been fixed, UTF-16 and UTF-32 can't be. >> And concatenating two UTF-16 strings from different sources requires >> knowledge about the encoding. And surrogate pairs complicate matters >> as well. > > Why do you think that surrogate pairs complicate matters? There can't > be any confusion whether an arbitrary 16 bit value is part a surrogate > pair or not; and if it is, whether it is the higher surrogate or the > lower one. One way to realize it is to compare UTF-16 with either UTF-8 or UTF-32. The surrogate pair construction make UTF-16 contain the disadvantage of both UTF-8 and UTF-32, but none of their advantage. The disadvantage with UTF-8 is that you don't know where a code value ends within the encoded data without knowledge of UTF-8, and the disadvantage with UTF-32 is that it wastes space since most data fit in 16 bits or less. If normal computers was 16 bit, I could understand the trade-off, but with 32 bit (or more) machines you can remove one of the disadvantages by choosing either UTF-8 or UTF-32 instead of UTF-16. > As for concatenating I'd say this depends on whether the tools are > able to deal with it. Right, and many tools assume that if you receive two binary blobs A and B which are said to contain text, you can form the concatenation of the text by concatenating the binary blobs as A||B. This is a reasonable assumption, and it works for most encoding schemes, including UTF-8. It doesn't work for UTF-16 or UTF-32. My preference is to use UTF-8 when data is stored or transfered, and only use UTF-32 internally because applications may need to compare data against Unicode code points. If I must use Unicode at all, that is.