From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/16925 Path: main.gmane.org!not-for-mail From: =?ISO-8859-1?Q?Fran=E7ois_Pinard?= Newsgroups: gmane.emacs.gnus.general Subject: Re: "Coding system"? Eh? Date: 11 Sep 1998 10:16:20 +-400 Sender: owner-ding@hpc.uh.edu Message-ID: References: NNTP-Posting-Host: coloc-standby.netfonds.no Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit X-Trace: main.gmane.org 1035155717 30682 80.91.224.250 (20 Oct 2002 23:15:17 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Sun, 20 Oct 2002 23:15:17 +0000 (UTC) Cc: ding@gnus.org Return-Path: Original-Received: from gizmo.hpc.uh.edu (gizmo.hpc.uh.edu [129.7.102.31]) by sclp3.sclp.com (8.8.5/8.8.5) with ESMTP id KAA14014 for ; Fri, 11 Sep 1998 10:52:02 -0400 (EDT) Original-Received: from sina.hpc.uh.edu (sina.hpc.uh.edu [129.7.3.5]) by gizmo.hpc.uh.edu (8.7.6/8.7.3) with ESMTP id JAF11206; Fri, 11 Sep 1998 09:23:03 -0500 Original-Received: by sina.hpc.uh.edu (TLB v0.09a (1.20 tibbs 1996/10/09 22:03:07)); Fri, 11 Sep 1998 09:50:31 -0500 (CDT) Original-Received: from sclp3.sclp.com (root@sclp3.sclp.com [209.195.19.139]) by sina.hpc.uh.edu (8.7.3/8.7.3) with ESMTP id JAA04493 for ; Fri, 11 Sep 1998 09:50:12 -0500 (CDT) Original-Received: from pluton.rtsq.qc.ca (pluton.grics.qc.ca [199.84.132.10]) by sclp3.sclp.com (8.8.5/8.8.5) with ESMTP id KAA13964 for ; Fri, 11 Sep 1998 10:50:03 -0400 (EDT) Original-Received: by pluton.rtsq.qc.ca (8.8.8/8.8.8) with UUCP id KAA03912; Fri, 11 Sep 1998 10:16:55 -0400 Original-Received: by icule.progiciels-bpi.ca (8.8.5/8.7.3) id KAA01567; Fri, 11 Sep 1998 10:16:21 -0400 Original-To: davidk@lysator.liu.se (David Kågedal) X-Face: "b_m|CE6#'Q8fliQrwHl9K,]PA_o'*S~Dva{~b1n*)K*A(BIwQW.:LY?t4~xhYka_.LV?Qq `}X|71X0ea&H]9Dsk!`kxBXlG;q$mLfv_vtaHK_rHFKu]4'<*LWCyUe@ZcI6"*wB5M@[m > Does Unicode now defines UTF-7? It originated from the IETF, and UTF-7 > > is specifically for MIME contexts, which Unicode does not address. > I might be wrong about the origin of UTF-7. But it's still ugly. We are all helping each other, here, it is not that important that we are wrong or right, as long as we improve. If UTF-7 has been adopted by Unicode, I would surely have liked to know, because the `recode' documentation would then need to be adjusted. About the ugliness of UTF-7, I agree to a certain extent. For ding readers who do not know, UTF-7 is a kind of quoted-printable for characters using more than 8 bits, and is suited for transmission over 7 bit channels. Very roughly said, instead of `=', it uses `+', and instead of hexadecimal values, it uses in-lined Base64. I found it a bit painful to write an UTF-7 encoder and decoder, but now that it's done, the algorithmic ugliness (which is another kind of ugliness) is all hidden in black boxes, and we might consider that it is not in the way anymore. UTF-8 has its elegances, but still it is slightly painful to write _efficient_ encoders/decoders. For transmission of Unicode or ISO 10646 message bodies, it looks to me that we have the choice between UTF-8 and UTF-7. UCS-2 and UCS-4 are internal formats not well suited for transmission, UTF-1 is obsolete, and UTF-16 is not much better than UCS-2 for transmission before all machines replace 8-bit bytes with 16-bit bytes, and this will not happen in this century :-). In fact, we have to look at things with a cold eye, here. If you do not have an integrated decoder in Gnus or in your other mail readers, I would not be sure which of UTF-8 or UTF-7 looks uglier. UTF-8 would look like a mix of ASCII and binary dump, UTF-7 would look like ASCII with fragments of Base64 in it. I might prefer UTF-7, after all, maybe. And if you have a decoder well integrated, you do not see the ugliness. The algorithmic ugliness is hidden once and for all in black boxes anyway, and then, it does not really matter. > The difference between UTF-16 and UCS-2 is that it can encode some of > the charaters outside the Unicode range (BMP). So I guess Unicode has > no need for UTF-16. Unicode needs it, because people are beginning to see that 65.000 characters are not as _enough_ as it was once thought (hmph! I suspect this is strange English :-). I mean that a few years ago, it was believed that 65.000 characters were to satisfied all our needs for a lot of years, but relatively soon, people began to see that it is not enough, and that we need a way to get more characters. ISO 10646 had much higher goals to start with, so it did not have that problem. UTF-16 extends the Unicode set to around 1.000.000 characters, still much less than ISO 10646, but yet, much more comfortable than 65.000 -- and ISO 10646 later made room in its BMP so the UTF-16 technique be more simply implementable. I do not think ISO 10646 ever needed UTF-16, but it wanted Unicode compatibility. -- François Pinard mailto:pinard@iro.umontreal.ca Join the free Translation Project! http://www.iro.umontreal.ca/~pinard