From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: * X-Spam-Status: No, score=1.3 required=5.0 tests=SPF_FAIL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by yquem.inria.fr (Postfix) with ESMTP id B9E42BBAF for ; Wed, 12 Aug 2009 20:24:16 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AhMCANulgkpQW+UCe2dsb2JhbACaWAEBFiQEuEGEGQWBTA X-IronPort-AV: E=Sophos;i="4.43,369,1246831200"; d="scan'208";a="32357953" Received: from main.gmane.org (HELO ciao.gmane.org) ([80.91.229.2]) by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/AES256-SHA; 12 Aug 2009 20:24:16 +0200 Received: from list by ciao.gmane.org with local (Exim 4.43) id 1MbIUe-0000nc-2m for caml-list@inria.fr; Wed, 12 Aug 2009 18:24:12 +0000 Received: from cs-wlc-066.cs.umn.edu ([128.101.33.66]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 12 Aug 2009 18:24:12 +0000 Received: from michael+ocaml by cs-wlc-066.cs.umn.edu with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 12 Aug 2009 18:24:12 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: caml-list@inria.fr From: Michael Ekstrand Subject: Re: Storing UTF-8 in plain strings Date: Wed, 12 Aug 2009 13:24:10 -0500 Message-ID: References: <51021.92432.qm@web111506.mail.gq1.yahoo.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: cs-wlc-066.cs.umn.edu User-Agent: Mozilla-Thunderbird 2.0.0.19 (X11/20090103) In-Reply-To: <51021.92432.qm@web111506.mail.gq1.yahoo.com> Sender: news X-Spam: no; 0.00; ocaml:01 dependencies:01 sprintf:01 wrote:01 functions:01 strings:01 strings:01 newline:02 data:02 encoding:02 encoding:02 parse:02 string:02 string:02 module:03 Dario Teixeira wrote: > Hi, > > I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying > on plain strings for processing and storing data. I *think* I can get away > with using only the String module to handle this variable-length encoding > as long as I am careful with the way I treat these strings. Here are the > assumptions I am making: > > - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed. > I can therefore assume in subsequent steps that the source is compliant. > > - It is forbidden to use String.get, String.sub, String.length, or other > functions where awareness of variable-length encoding is required. > > - String concatenation is allowed. > > - Using Extlib's String.nsplit is okay if the separator is a newline (0x0a), > because in a multi-byte sequence all bytes have a value > 127. There is > therefore no chance of splitting a multi-byte sequence down the middle. > > > So, can someone find any problems with this reasoning? (Thanks in advance!) It looks good to me. Further, much of the functionality you're forbidding yourself in String is provided by Extlib's UTF8 module without additional dependencies if you do need it at some place in your program. You can also go ahead and use buffers, sprintf, etc., so long as everything is valid UTF-8. They won't care; they'll only see a sequence of bytes. - Michael