From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id XAA04514 for caml-red; Mon, 22 Jan 2001 23:05:24 +0100 (MET) Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id VAA08732 for ; Mon, 22 Jan 2001 21:28:37 +0100 (MET) Received: from localhost.localdomain (jimbo53.zip.com.au [202.7.88.53]) by concorde.inria.fr (8.11.1/8.10.0) with ESMTP id f0MKSXb06425 for ; Mon, 22 Jan 2001 21:28:34 +0100 (MET) Received: from ozemail.com.au (IDENT:root@localhost [127.0.0.1]) by localhost.localdomain (8.9.3/8.8.7) with ESMTP id HAA31944; Tue, 23 Jan 2001 07:27:47 +1100 Message-ID: <3A6C97C3.C109DC15@ozemail.com.au> Date: Tue, 23 Jan 2001 07:27:47 +1100 From: John Max Skaller X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.12-20 i686) X-Accept-Language: en MIME-Version: 1.0 To: Pierpaolo BERNARDI CC: OCAML Subject: Re: Unicode (was RE: JIT-compilation for OCaml?) References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: weis@pauillac.inria.fr Pierpaolo BERNARDI wrote: > Let me repeat: ISO has formally agreed to not use code points outside of > the Unicode possibility. OK, accepted. > This leaves room for about 2^20 characters. > > Indeed, some code points from the BMP are reserved > > so Unicode can use multi-word encodings of the lower 4 planes. > > Unicode can be encoded in several ways, for example, UTF-8, UTF-16, > UTF-32, UCS2, etc.. This has nothing to do with the number of characters > that can be encoded. This is not quite right. Unicode is 16 bit, it supports only 2^16 code points: again, unless this has changed recently. However, some of the code points are reserved for UCS-16 encoding of a larger space of 2^20 code points (another four bits -- I was wrong, this is the lower 16 (not 4) planes). So it is not quite true that it has 'nothing to do with the number of characters that can be encoded', since some of the code points of the BMP are reserved precisely for the purpose of two word encodings of a larger space. (I think these are the High and Low Surrogates: U+d8xx, U+dcxx respectively). Note that this is only loosely connected with the encoding of _characters_, since some code points are not characters (such as 'newline'), and some sequences of code points represent a single (accented) character :-) -- John (Max) Skaller, mailto:skaller@maxtal.com.au 10/1 Toxteth Rd Glebe NSW 2037 Australia voice: 61-2-9660-0850 checkout Vyper http://Vyper.sourceforge.net download Interscript http://Interscript.sourceforge.net