From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gclci-caml-list@m.gmane.org>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: *
X-Spam-Status: No, score=1.3 required=5.0 tests=SPF_FAIL autolearn=disabled 
	version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by yquem.inria.fr (Postfix) with ESMTP id B9E42BBAF
	for <caml-list@yquem.inria.fr>; Wed, 12 Aug 2009 20:24:16 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AhMCANulgkpQW+UCe2dsb2JhbACaWAEBFiQEuEGEGQWBTA
X-IronPort-AV: E=Sophos;i="4.43,369,1246831200"; 
   d="scan'208";a="32357953"
Received: from main.gmane.org (HELO ciao.gmane.org) ([80.91.229.2])
  by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/AES256-SHA; 12 Aug 2009 20:24:16 +0200
Received: from list by ciao.gmane.org with local (Exim 4.43)
	id 1MbIUe-0000nc-2m
	for caml-list@inria.fr; Wed, 12 Aug 2009 18:24:12 +0000
Received: from cs-wlc-066.cs.umn.edu ([128.101.33.66])
        by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <caml-list@inria.fr>; Wed, 12 Aug 2009 18:24:12 +0000
Received: from michael+ocaml by cs-wlc-066.cs.umn.edu with local (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <caml-list@inria.fr>; Wed, 12 Aug 2009 18:24:12 +0000
X-Injected-Via-Gmane: http://gmane.org/
To: caml-list@inria.fr
From: Michael Ekstrand <michael+ocaml@elehack.net>
Subject:  Re: Storing UTF-8 in plain strings
Date:  Wed, 12 Aug 2009 13:24:10 -0500
Message-ID: <h5v1bv$otn$1@ger.gmane.org>
References:  <51021.92432.qm@web111506.mail.gq1.yahoo.com>
Mime-Version:  1.0
Content-Type:  text/plain; charset=UTF-8
Content-Transfer-Encoding:  7bit
X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: cs-wlc-066.cs.umn.edu
User-Agent: Mozilla-Thunderbird 2.0.0.19 (X11/20090103)
In-Reply-To: <51021.92432.qm@web111506.mail.gq1.yahoo.com>
Sender: news <news@ger.gmane.org>
X-Spam: no; 0.00; ocaml:01 dependencies:01 sprintf:01 wrote:01 functions:01 strings:01 strings:01 newline:02 data:02 encoding:02 encoding:02 parse:02 string:02 string:02 module:03 

Dario Teixeira wrote:
> Hi,
> 
> I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying
> on plain strings for processing and storing data.  I *think* I can get away
> with using only the String module to handle this variable-length encoding
> as long as I am careful with the way I treat these strings.  Here are the
> assumptions I am making:
> 
> - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed.
>   I can therefore assume in subsequent steps that the source is compliant.
> 
> - It is forbidden to use String.get, String.sub, String.length, or other
>   functions where awareness of variable-length encoding is required.
> 
> - String concatenation is allowed.
> 
> - Using Extlib's String.nsplit is okay if the separator is a newline (0x0a),
>   because in a multi-byte sequence all bytes have a value > 127.  There is
>   therefore no chance of splitting a multi-byte sequence down the middle.
> 
> 
> So, can someone find any problems with this reasoning?  (Thanks in advance!)

It looks good to me.  Further, much of the functionality you're
forbidding yourself in String is provided by Extlib's UTF8 module
without additional dependencies if you do need it at some place in your
program.

You can also go ahead and use buffers, sprintf, etc., so long as
everything is valid UTF-8.  They won't care; they'll only see a sequence
of bytes.

- Michael