From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id MAA09924; Thu, 17 Jan 2002 12:26:15 +0100 (MET) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id MAA10038 for ; Thu, 17 Jan 2002 12:26:14 +0100 (MET) Received: from mbg.sphere.ne.jp (mbg.sphere.ne.jp [203.138.71.44]) by nez-perce.inria.fr (8.11.1/8.11.1) with ESMTP id g0HBQCf22495; Thu, 17 Jan 2002 12:26:12 +0100 (MET) Received: from localhost (pl093.nas521.k-tokyo.nttpc.ne.jp [210.165.66.93]) by mbg.sphere.ne.jp (8.9.3+3.2W/3.7W) with ESMTP id UAA02075; Thu, 17 Jan 2002 20:25:47 +0900 (JST) In-Reply-To: References: <20020110185619.A20606@pauillac.inria.fr> Subject: RE: [Caml-list] Non-mutable strings To: mattias.waldau@abc.se Cc: caml-list@inria.fr, xavier.leroy@inria.fr X-Mailer: Mew version 1.94.2 on XEmacs 21.1 (Capitol Reef) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: <20020117185634Y.yoriyuki@ms.u-tokyo.ac.jp> Date: Thu, 17 Jan 2002 18:56:34 +0900 From: YAMAGATA yoriyuki X-Dispatcher: imput version 991025(IM133) Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk From: "Mattias Waldau" Subject: RE: [Caml-list] Non-mutable strings Date: Wed, 16 Jan 2002 20:22:36 +0100 > Thus, introducing Unicode strings (or something similar, I heard that Asians > don't like Unicode at all) and introducing non-mutable strings should > preferrable be done simultaneously. There is criticism to Unicode (Most of them goes to Han-unification, which integrates all regional variants of ideographics to a single set of character), but as far as I know, it is the only international character set in which the standard ways of string matching, comparison and sorting are defined. Pattern matching is important to caml, so I think using Unicode is preferable. > P.s. Microsoft NT, 2000, XP handles double byte chars everywhere, it is > called BSTR and in order to make string comparasion etc library-routines are > called all the time. However, since Unicode can be 4 byte, I don't know how > that is encoded into 2 bytes. Unicode standard requires handling an unicode character as one or two 16bits integers. If a characters is longer than 2 bytes, it is represented as a pair of surrogate points (specially aligned 16 bits integers for this purpose.) Surrogate pairs can only represent 3 bytes character, so Unicode as its narrow sense can only be 3 bytes. I don't know whether Windows supports surrogates, but since MS is one of the founding members of Unicode consortium, they will be supported in the future, any ways. However, Unicode, as customary called, has another standard, ISO-UCS. ISO-UCS allows that characters becomes 31-bits long, and ISO seems to recommend that all characters are represented as 32-bits integers. Clearly, ISO approaches are more simple and allows fast indexing. On the other hand, Unicode is more widely used and provide better algorithm for case mapping, character classification etc. For caml, in my really humble opinion, the language had better to hide such difference (16-bits or 32-bits) and if it can not be hidden (like case mapping), offer choice to users. Regards -- YAMAGATA, yoriyuki (doctoral student) Department of Mathematical Science, University of Tokyo. ------------------- Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr