From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id XAA09931 for caml-red; Mon, 22 Jan 2001 23:05:51 +0100 (MET) Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id WAA08362 for ; Mon, 22 Jan 2001 22:46:01 +0100 (MET) Received: from mailserver.cli.di.unipi.it (crudelia.cli.di.unipi.it [131.114.11.37]) by nez-perce.inria.fr (8.11.1/8.10.0) with ESMTP id f0MLk1X15158 for ; Mon, 22 Jan 2001 22:46:01 +0100 (MET) Organization: Centro di Calcolo - Dipartimento di Informatica di Pisa - Italy Received: from fire.cli.di.unipi.it (fire-ext.cli.di.unipi.it [131.114.11.52]) by mailserver.cli.di.unipi.it (8.9.3/8.9.3) with SMTP id WAA28351 for ; Mon, 22 Jan 2001 22:43:18 +0100 Received: (qmail 4641 invoked by uid 7794); 22 Jan 2001 21:45:57 -0000 Received: from carlotta.cli.di.unipi.it(131.114.11.15) via SMTP by crudelia.cli.di.unipi.it, id smtpda04505; Mon Jan 22 22:44:09 2001 Received: from localhost (bernardp@localhost) by carlotta.cli.di.unipi.it (8.8.5/8.6.12) with SMTP id WAA00864; Mon, 22 Jan 2001 22:44:26 +0100 (MET) X-Authentication-Warning: carlotta.cli.di.unipi.it: bernardp owned process doing -bs Date: Mon, 22 Jan 2001 22:44:26 +0100 (MET) From: Pierpaolo BERNARDI To: John Max Skaller cc: OCAML Subject: Re: Unicode (was RE: JIT-compilation for OCaml?) In-Reply-To: <3A6C97C3.C109DC15@ozemail.com.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: weis@pauillac.inria.fr On Tue, 23 Jan 2001, John Max Skaller wrote: > > Unicode can be encoded in several ways, for example, UTF-8, UTF-16, > > UTF-32, UCS2, etc.. This has nothing to do with the number of characters > > that can be encoded. > > This is not quite right. Unicode is 16 bit, it supports > only 2^16 code points: again, unless this has > changed recently. Sorry, no. The idea that "Unicode is 16 bit" is a relic of the prehistory of Unicode, when it was thought that 64K characters would suffice. Unicode associates abstract characters with a numeric index (scalar value). These scalar values, to be stored in a computer must be serialized. Some serialization methods are, for example: UTF-8 (based on 8-bit chunks), UTF-16 (based on 16-bit chunks), UTF-32 (based on 32-bit chunks). With UTF-32 there's a simple correspondence between Unicode scalar values and chunks: the numeric value of the 32-bit chunk is the scalar value; with the other two formats, Unicode characters use a variable number of chunks: 1 to 4 for UTF-8, 1 or 2 for UTF-16. There are other serialization methods, like, for example the archaic UCS-2 (which uses surrogates, and which I think is used by Microsoft), UTF-7, UTF-EBCDIC, UTF-7,5, etc. Unicode scalar values range from 0x0000 to 0x100000. The forthcoming Unicode 3.1 uses 94,140 of these values. And the same is true for ISO. There are some discrepancies which are due only to the different times of publication of the respective standards. > However, some of the code points are reserved > for UCS-16 encoding of a larger space of 2^20 code points (another You mean UCS-2. UCS-2 is just one encoding among many others, it is not identical to Unicode. Yes, there are code points whose only purpose is to be used with this particular encoding. > Note that this is only loosely connected with > the encoding of _characters_, since some code points are > not characters (such as 'newline'), and some sequences > of code points represent a single (accented) character :-) Yes, although the word 'character' has many meanings: Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader's understanding. (2) Synonym for abstract character. (See Definition D3 in Section 3.3, Characters and Coded Representations .) (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. (See ideograph (2).) (from the Unicode Glossary) You mean meaning 1. Usually, in computer related discussions meaning 3 is intended. If you are interested in sorting out the details, you can read UTR-17: http://www.unicode.org/unicode/reports/tr17/ Hope this helps. Pierpaolo