From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=AWL,SPF_FAIL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by yquem.inria.fr (Postfix) with ESMTP id 30F5ABBAF for ; Thu, 13 Aug 2009 13:00:23 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AtcDAA6Pg0pQRFuwWWdsb2JhbACaVgEWFQS5E4QZBYFM X-IronPort-AV: E=Sophos;i="4.43,373,1246831200"; d="scan'208";a="32381291" Received: from furbychan.cocan.org ([80.68.91.176]) by mail3-smtp-sop.national.inria.fr with ESMTP; 13 Aug 2009 13:00:22 +0200 Received: from rich by furbychan.cocan.org with local (Exim 4.63) (envelope-from ) id 1MbY2f-0007Du-1b; Thu, 13 Aug 2009 12:00:21 +0100 Date: Thu, 13 Aug 2009 12:00:21 +0100 To: Dario Teixeira Cc: caml-list@yquem.inria.fr Subject: Re: [Caml-list] Storing UTF-8 in plain strings Message-ID: <20090813110021.GA27575@annexia.org> References: <51021.92432.qm@web111506.mail.gq1.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51021.92432.qm@web111506.mail.gq1.yahoo.com> User-Agent: Mutt/1.5.13 (2006-08-11) From: Richard Jones X-Spam: no; 0.00; lowercase:01 uncapitalize:01 lowercase:01 extensively:01 camomile:01 2009:98 char:01 char:01 wrote:01 caml-list:01 functions:01 functions:01 strings:01 strings:01 data:02 On Wed, Aug 12, 2009 at 10:36:56AM -0700, Dario Teixeira wrote: > Hi, > > I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying > on plain strings for processing and storing data. I *think* I can get away > with using only the String module to handle this variable-length encoding > as long as I am careful with the way I treat these strings. Here are the > assumptions I am making: > > - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed. > I can therefore assume in subsequent steps that the source is compliant. > > - It is forbidden to use String.get, String.sub, String.length, or other > functions where awareness of variable-length encoding is required. Needless to say, don't use String.uppercase, String.lowercase, String.capitalize, String.uncapitalize, Char.uppercase or Char.lowercase. These all assume ISO-8859-1. I've written a number of applications which used UTF-8 extensively, including one which worked entirely in Japanese, and I've never had a problem. Just avoid the bad String/Char functions. Use either a database or a module like Ulex/Camomile. You'll be fine. Rich. -- Richard Jones Red Hat