From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: ** X-Spam-Status: No, score=2.8 required=5.0 tests=AWL,DNS_FROM_RFC_POST, MAILTO_TO_SPAM_ADDR,SPF_NEUTRAL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by yquem.inria.fr (Postfix) with ESMTP id 03A0EBBAF for ; Wed, 12 Aug 2009 23:34:18 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQCAL/RgkrRVd7Dimdsb2JhbACaHT8BAQEKCQwHEQWnPJB9AQMCBII8gVkFgUyGfA X-IronPort-AV: E=Sophos;i="4.43,370,1246831200"; d="scan'208";a="34337478" Received: from mail-pz0-f195.google.com ([209.85.222.195]) by mail1-smtp-roc.national.inria.fr with ESMTP; 12 Aug 2009 23:34:17 +0200 Received: by pzk33 with SMTP id 33so199078pzk.15 for ; Wed, 12 Aug 2009 14:34:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:from:date:x-google-sender-auth:message-id:subject:to:cc :content-type:content-transfer-encoding; bh=NiHGyWXT8y/M+IwwaIJFalVxXNpfQPgGwLHBzKHK97U=; b=Xpd/slnX+EJM+Q+EaZNDMrfXRdwb/kUZK8F+HsW10Iqxy9XVWRSQ5Rb9kDWEiRBRh+ 5YZoMDiKRpdV4A9mAxKj/491bU59BiGRSlbm1KqvXMmpf4k7Y/CeIvt1Z/j9LdpphWkP i/AF8SHsn4apUZMZM537P70H80EE6OzRWan5Q= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:cc:content-type :content-transfer-encoding; b=lLTINtCPe6X4oDk6+bsF8k4c+8g3EvzpmSANYwWOkPJR3956+j4JL/Ycd7gIX7e1nM CJFP2Rf1c4BM57fChD1/A8MmVkGsDdOe4uUT/MQSc60RJ3sinlmZ5B/t0udNQptNRVu3 nr9pVcFk+lyOMuB5kRVGala1WFaxJF4jP8foU= MIME-Version: 1.0 Sender: jake.donham@gmail.com Received: by 10.142.179.1 with SMTP id b1mr84275wff.295.1250112856079; Wed, 12 Aug 2009 14:34:16 -0700 (PDT) In-Reply-To: <4A832B4C.903@gmail.com> References: <51021.92432.qm@web111506.mail.gq1.yahoo.com> <4A832B4C.903@gmail.com> From: Jake Donham Date: Wed, 12 Aug 2009 14:33:56 -0700 X-Google-Sender-Auth: 187a105027c1ba0f Message-ID: Subject: Re: [Caml-list] Storing UTF-8 in plain strings To: Edgar Friendly Cc: Dario Teixeira , caml-list@yquem.inria.fr Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Spam: no; 0.00; ocamlnet:01 ocaml:01 2009:98 edgar:98 wrote:01 caml-list:01 strings:01 strings:01 parse:02 equivalents:02 string:02 string:02 module:03 raise:03 raise:03 On Wed, Aug 12, 2009 at 1:51 PM, Edgar Friendly wrote= : >> - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFor= med. >> =A0 I can therefore assume in subsequent steps that the source is compli= ant. >> > This is the weakest assumption of the four - Ulex could parse and only > raise MalFormed on some errors. =A0I'm no expert on Ulex, though... The original poster might be interested in the Netconversion module from Ocamlnet, which is designed to work with UTF-8 stored in OCaml strings. In particular it has a function Netconversion.verify `Enc_utf8 s which checks that s is a valid UTF-8 string. It also has equivalents for String.{get,sub,length}. Jake