From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: * X-Spam-Status: No, score=1.7 required=5.0 tests=AWL,DNS_FROM_RFC_ABUSE, DNS_FROM_RFC_POST autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id 2E345BBAF for ; Wed, 12 Aug 2009 19:37:32 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AiMCABObgkpDww+jmWdsb2JhbACSM4gkAQEBAQEICwoHE6YkgSKREwEFAwGEFQWBTA X-IronPort-AV: E=Sophos;i="4.43,368,1246831200"; d="scan'208";a="44609742" Received: from web111506.mail.gq1.yahoo.com ([67.195.15.163]) by mail4-smtp-sop.national.inria.fr with SMTP; 12 Aug 2009 19:36:58 +0200 Received: (qmail 92644 invoked by uid 60001); 12 Aug 2009 17:36:57 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1250098617; bh=kp6mgItr5HqfUec45KOAnkrklxR9oB2hwsQCDRsIEQc=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=glVxKe00K1D/useV7wwUPOSnKaVp1L+BIatEbypzhDADTFENog0hJcqWeIbhheWli4W0s/A0hqYTbG8m/i0JnXADYpn6XnrZ57YvnySBPSlphisJ5QRNJRXTZh7Hnst3TBboM5/U30jAQ00BOllyG2tetvuPcRf/zNTfseIHnRg= DomainKey-Signature:a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=b74FejleEtzyDbjdFhxsf7WRF8JraeWoH0+QF9qP9z0Ta0lydo3mhzPrnB51+Dxel5F0A5gWRfBB5SZSGywHjypx8JI9DplRvxqFNa0L053JA6KCUtQ1j7/Mozz/EFQOdpI6N1b/BHV5uWCYzro8KxM0yn0Na0P8Iy5QNX8h24M=; Message-ID: <51021.92432.qm@web111506.mail.gq1.yahoo.com> X-YMail-OSG: DbS5vhoVM1nx4B0LMJYqFkqAM4aDHP9aqtG0JxAKtgW8QsxB14ogjjJ0btZt2xX21n6QjbGhlnJJ8UdwpKMl4H.jhfDNcNsAMXCuQjohS2maRnbLj5.hIYV30BQXfxKup7nNdEPbhRUSFLeqDy33NgEqaNcQwc0mNBWUZ0DAE3nT1XnVrlWL9hAfXKxKKW5syd26MBmusM3AKj0erUhM0v8Ye4uJSHpCZxQtHaxikOORQYmq9yHoboIOxLBUSWtcVn5cbkH6Qqg3vwXOYTBMXb5dxlvUAhCT6T3tirZX9.FNKrHbTJa0Yo2mfx4yAB47ZxU7LKA- Received: from [213.205.71.49] by web111506.mail.gq1.yahoo.com via HTTP; Wed, 12 Aug 2009 10:36:56 PDT X-Mailer: YahooMailClassic/6.1.2 YahooMailWebService/0.7.338.1 Date: Wed, 12 Aug 2009 10:36:56 -0700 (PDT) From: Dario Teixeira Subject: Storing UTF-8 in plain strings To: caml-list@yquem.inria.fr MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam: no; 0.00; functions:01 strings:01 strings:01 newline:02 data:02 encoding:02 encoding:02 parse:02 string:02 string:02 module:03 module:03 bytes:03 raise:03 separator:04 Hi,=0A=0AI'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'= m relying=0Aon plain strings for processing and storing data. I *think* I = can get away=0Awith using only the String module to handle this variable-le= ngth encoding=0Aas long as I am careful with the way I treat these strings.= Here are the=0Aassumptions I am making:=0A=0A- If the source is invalid U= TF-8 in any way, Ulex will raise Utf8.MalFormed.=0A I can therefore assume= in subsequent steps that the source is compliant.=0A=0A- It is forbidden t= o use String.get, String.sub, String.length, or other=0A functions where a= wareness of variable-length encoding is required.=0A=0A- String concatenati= on is allowed.=0A=0A- Using Extlib's String.nsplit is okay if the separator= is a newline (0x0a),=0A because in a multi-byte sequence all bytes have a= value > 127. There is=0A therefore no chance of splitting a multi-byte s= equence down the middle.=0A=0A=0ASo, can someone find any problems with thi= s reasoning? (Thanks in advance!)=0A=0ABest regards,=0ADario Teixeira=0A= =0AP.S. And yes, I am aware that there are excellent libraries for handling= =0A UTF-8 (like the Rope module in Batteries).=0A=0A=0A=0A