From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id KAA17589 for caml-red; Fri, 12 Jan 2001 10:19:33 +0100 (MET) Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id JAA15311 for ; Fri, 12 Jan 2001 09:35:25 +0100 (MET) Received: from localhost.localdomain (kenny77.zip.com.au [61.8.18.205]) by nez-perce.inria.fr (8.11.1/8.10.0) with ESMTP id f0C8ZM522730 for ; Fri, 12 Jan 2001 09:35:22 +0100 (MET) Received: from ozemail.com.au (IDENT:root@localhost [127.0.0.1]) by localhost.localdomain (8.9.3/8.8.7) with ESMTP id TAA07162; Fri, 12 Jan 2001 19:33:54 +1100 Message-ID: <3A5EC172.CE9FBA65@ozemail.com.au> Date: Fri, 12 Jan 2001 19:33:54 +1100 From: John Max Skaller X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.12-20 i686) X-Accept-Language: en MIME-Version: 1.0 To: Dave Berry CC: Markus Mottl , OCAML Subject: Re: Unicode (was RE: JIT-compilation for OCaml?) References: <3145774E67D8D111BE6E00C0DF418B663AD720@nt.kal.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: weis@pauillac.inria.fr Dave Berry wrote: > > I thought Unicode was a recognised subset of ISO-10646, corresponding to the > range 0-2^16. Also, don't Windows NT/2000 use Unicode? Yes and Yes. More precisely, Unicode is often 'ahead' of ISO, adding new characters which make it into new versions of ISO-10646 later. > My knowledge of C/C++ is probably out of date, but I thought they just used > the wide character type, without requiring a particular internal > representation. In what way do ISO C/C++ support ISO-10646? There are, for example, both 16 and 31 bit escapes. What the compiler does with them is implementation defined I think, that is, it can silently truncate to 16 or even 8 bits, but the programmer can still encode any ISO-10646 character. The type 'whchar_t' has implementation defined size in C++ (like all the other integral types). This doesn't exclude using 32 bit characters. > (I realise this isn't directly on-topic, but it may be relevant for future > extensions to OCaml?) I think it is. In particular, Ocaml supports 8 bit characters, and even allows the high 128 bytes to be used in identifiers (to allow French names :-) When and if this support is upgraded, Ocaml should go to full ISO-10646 support: for identifiers this is easily done by using UTF-8 (and providing an codec to convert Latin-1 for backward compatibility). Supporting 2^31 code points in regular expressions is more difficult. Collation is a nightmare :-) -- John (Max) Skaller, mailto:skaller@maxtal.com.au 10/1 Toxteth Rd Glebe NSW 2037 Australia voice: 61-2-9660-0850 checkout Vyper http://Vyper.sourceforge.net download Interscript http://Interscript.sourceforge.net