From mboxrd@z Thu Jan  1 00:00:00 1970
Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id MAA09924; Thu, 17 Jan 2002 12:26:15 +0100 (MET)
X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id MAA10038 for <caml-list@pauillac.inria.fr>; Thu, 17 Jan 2002 12:26:14 +0100 (MET)
Received: from mbg.sphere.ne.jp (mbg.sphere.ne.jp [203.138.71.44])
	by nez-perce.inria.fr (8.11.1/8.11.1) with ESMTP id g0HBQCf22495;
	Thu, 17 Jan 2002 12:26:12 +0100 (MET)
Received: from localhost (pl093.nas521.k-tokyo.nttpc.ne.jp [210.165.66.93])
	by mbg.sphere.ne.jp (8.9.3+3.2W/3.7W) with ESMTP id UAA02075;
	Thu, 17 Jan 2002 20:25:47 +0900 (JST)
In-Reply-To: <AAEBJHFJOIPMMIILCEPBKEPBDGAA.mattias.waldau@abc.se>
References: <20020110185619.A20606@pauillac.inria.fr>
	<AAEBJHFJOIPMMIILCEPBKEPBDGAA.mattias.waldau@abc.se>
Subject: RE: [Caml-list] Non-mutable strings
To: mattias.waldau@abc.se
Cc: caml-list@inria.fr, xavier.leroy@inria.fr
X-Mailer: Mew version 1.94.2 on XEmacs 21.1 (Capitol Reef)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-Id: <20020117185634Y.yoriyuki@ms.u-tokyo.ac.jp>
Date: Thu, 17 Jan 2002 18:56:34 +0900
From: YAMAGATA yoriyuki <yoriyuki@ms.u-tokyo.ac.jp>
X-Dispatcher: imput version 991025(IM133)
Sender: owner-caml-list@pauillac.inria.fr
Precedence: bulk

From: "Mattias Waldau" <mattias.waldau@abc.se>
Subject: RE: [Caml-list] Non-mutable strings
Date: Wed, 16 Jan 2002 20:22:36 +0100

> Thus, introducing Unicode strings (or something similar, I heard that Asians
> don't like Unicode at all) and introducing non-mutable strings should
> preferrable be done simultaneously.

There is criticism to Unicode (Most of them goes to Han-unification,
which integrates all regional variants of ideographics to a single set
of character), but as far as I know, it is the only international
character set in which the standard ways of string matching,
comparison and sorting are defined.  Pattern matching is important to
caml, so I think using Unicode is preferable.

> P.s. Microsoft NT, 2000, XP handles double byte chars everywhere, it is
> called BSTR and in order to make string comparasion etc library-routines are
> called all the time. However, since Unicode can be 4 byte, I don't know how
> that is encoded into 2 bytes.

Unicode standard requires handling an unicode character as one or two
16bits integers.  If a characters is longer than 2 bytes, it is
represented as a pair of surrogate points (specially aligned 16 bits
integers for this purpose.)  Surrogate pairs can only represent 3
bytes character, so Unicode as its narrow sense can only be 3 bytes.
I don't know whether Windows supports surrogates, but since MS is one
of the founding members of Unicode consortium, they will be supported
in the future, any ways.

However, Unicode, as customary called, has another standard, ISO-UCS.
ISO-UCS allows that characters becomes 31-bits long, and ISO seems to
recommend that all characters are represented as 32-bits integers.

Clearly, ISO approaches are more simple and allows fast indexing.  On
the other hand, Unicode is more widely used and provide better
algorithm for case mapping, character classification etc.

For caml, in my really humble opinion, the language had better to hide
such difference (16-bits or 32-bits) and if it can not be hidden (like
case mapping), offer choice to users.

Regards
--
YAMAGATA, yoriyuki (doctoral student)
Department of Mathematical Science, University of Tokyo.
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr