From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: ** X-Spam-Status: No, score=2.8 required=5.0 tests=AWL,DNS_FROM_RFC_POST, SPF_NEUTRAL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by yquem.inria.fr (Postfix) with ESMTP id 406C2BBAF for ; Wed, 12 Aug 2009 22:51:31 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Av8BANfHgkrRVd7Dimdsb2JhbACaGj8BAQEKCQwHEQWmNoEikQABAwIEhBUFgUw X-IronPort-AV: E=Sophos;i="4.43,370,1246831200"; d="scan'208";a="32360980" Received: from mail-pz0-f195.google.com ([209.85.222.195]) by mail3-smtp-sop.national.inria.fr with ESMTP; 12 Aug 2009 22:51:30 +0200 Received: by pzk33 with SMTP id 33so184989pzk.15 for ; Wed, 12 Aug 2009 13:51:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:cc:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=kbCl08m4/g2e6k0yV4q4C9VY/AE0Fr9H93mMqD93/1E=; b=IvHDQAKFXIG9rgRflnn/sfLCm+9BBQD6/iVBwFMYhV8gUT37i2pmYJOdI9yVVdpHZ2 emxz8howOaaDfNhb4sstoQNnffPr5Ma2PRu9QP6L6TMThefuqdkaVr8q8mzkyMXexYwI yxMG7PrzuqAh2WEsmWiv8MsZDd8cBQnnDyVDc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; b=KOJZlm7F05BgYMQi40mKTeUAO/aPb+BUTB3p2I9h3t7wWEYFQ9tzWXpjZRphM26gvo 2U5hdnl2RpYEIPQjXtK2wGVK0oITF47cnu2QbLhJ1xwW/aIPsAsOXT1rxBVBrnrqy9Ty V/MHG2F0M9s1T4GMxcLM3Y+AUcq41f/BEW0mU= Received: by 10.115.103.9 with SMTP id f9mr355727wam.227.1250110288929; Wed, 12 Aug 2009 13:51:28 -0700 (PDT) Received: from ?192.168.1.100? ([72.44.99.175]) by mx.google.com with ESMTPS id f20sm13259589waf.52.2009.08.12.13.51.26 (version=TLSv1/SSLv3 cipher=RC4-MD5); Wed, 12 Aug 2009 13:51:27 -0700 (PDT) Message-ID: <4A832B4C.903@gmail.com> Date: Wed, 12 Aug 2009 16:51:24 -0400 From: Edgar Friendly User-Agent: Thunderbird 2.0.0.22 (X11/20090608) MIME-Version: 1.0 To: Dario Teixeira Cc: caml-list@yquem.inria.fr Subject: Re: [Caml-list] Storing UTF-8 in plain strings References: <51021.92432.qm@web111506.mail.gq1.yahoo.com> In-Reply-To: <51021.92432.qm@web111506.mail.gq1.yahoo.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam: no; 0.00; edgar:98 wrote:01 caml-list:01 functions:01 functions:01 strings:01 strings:01 newline:02 data:02 encoding:02 encoding:02 parse:02 parse:02 string:02 string:02 Dario Teixeira wrote: > Hi, > > I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying > on plain strings for processing and storing data. I *think* I can get away > with using only the String module to handle this variable-length encoding > as long as I am careful with the way I treat these strings. Here are the > assumptions I am making: > > - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed. > I can therefore assume in subsequent steps that the source is compliant. > This is the weakest assumption of the four - Ulex could parse and only raise MalFormed on some errors. I'm no expert on Ulex, though... > - It is forbidden to use String.get, String.sub, String.length, or other > functions where awareness of variable-length encoding is required. > Yes, those functions work on bytes, not on characters. > - String concatenation is allowed. > Yes, two valid UTF-8 strings concatenate into another valid UTF-8 string. > - Using Extlib's String.nsplit is okay if the separator is a newline (0x0a), > because in a multi-byte sequence all bytes have a value > 127. There is > therefore no chance of splitting a multi-byte sequence down the middle. > Yes, you can split on low bytes, multibyte characters start with 0b11xx xxxx and continue with 0b10xx xxxx. E