From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rich@annexia.org>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=AWL,SPF_FAIL 
	autolearn=disabled version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by yquem.inria.fr (Postfix) with ESMTP id 30F5ABBAF
	for <caml-list@yquem.inria.fr>; Thu, 13 Aug 2009 13:00:23 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AtcDAA6Pg0pQRFuwWWdsb2JhbACaVgEWFQS5E4QZBYFM
X-IronPort-AV: E=Sophos;i="4.43,373,1246831200"; 
   d="scan'208";a="32381291"
Received: from furbychan.cocan.org ([80.68.91.176])
  by mail3-smtp-sop.national.inria.fr with ESMTP; 13 Aug 2009 13:00:22 +0200
Received: from rich by furbychan.cocan.org with local (Exim 4.63)
	(envelope-from <rich@annexia.org>)
	id 1MbY2f-0007Du-1b; Thu, 13 Aug 2009 12:00:21 +0100
Date: Thu, 13 Aug 2009 12:00:21 +0100
To: Dario Teixeira <darioteixeira@yahoo.com>
Cc: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Storing UTF-8 in plain strings
Message-ID: <20090813110021.GA27575@annexia.org>
References: <51021.92432.qm@web111506.mail.gq1.yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <51021.92432.qm@web111506.mail.gq1.yahoo.com>
User-Agent: Mutt/1.5.13 (2006-08-11)
From: Richard Jones <rich@annexia.org>
X-Spam: no; 0.00; lowercase:01 uncapitalize:01 lowercase:01 extensively:01 camomile:01 2009:98 char:01 char:01 wrote:01 caml-list:01 functions:01 functions:01 strings:01 strings:01 data:02 

On Wed, Aug 12, 2009 at 10:36:56AM -0700, Dario Teixeira wrote:
> Hi,
> 
> I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying
> on plain strings for processing and storing data.  I *think* I can get away
> with using only the String module to handle this variable-length encoding
> as long as I am careful with the way I treat these strings.  Here are the
> assumptions I am making:
> 
> - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed.
>   I can therefore assume in subsequent steps that the source is compliant.
> 
> - It is forbidden to use String.get, String.sub, String.length, or other
>   functions where awareness of variable-length encoding is required.

Needless to say, don't use String.uppercase, String.lowercase,
String.capitalize, String.uncapitalize, Char.uppercase or
Char.lowercase.  These all assume ISO-8859-1.

I've written a number of applications which used UTF-8 extensively,
including one which worked entirely in Japanese, and I've never had a
problem.  Just avoid the bad String/Char functions.  Use either a
database or a module like Ulex/Camomile.  You'll be fine.

Rich.

-- 
Richard Jones
Red Hat