From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: * X-Spam-Status: No, score=1.0 required=5.0 tests=AWL,SPF_NEUTRAL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id 03CD3BC6B for ; Wed, 10 Oct 2007 10:05:16 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAABMkDEfRVca/i2dsb2JhbACOSAIBCAIGExEF X-IronPort-AV: E=Sophos;i="4.21,253,1188770400"; d="scan'208";a="17786083" Received: from rv-out-0910.google.com ([209.85.198.191]) by mail4-smtp-sop.national.inria.fr with ESMTP; 10 Oct 2007 10:05:13 +0200 Received: by rv-out-0910.google.com with SMTP id k20so120959rvb for ; Wed, 10 Oct 2007 01:05:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=qk49QtBNsSsgzVNXNSFrWOMOTOAIdk9glOW47RNDN8s=; b=ZFOzGFPsvle4J7VZ1ZUhmYgroIENHgSfJlAz4LNUXLc68n/Q61mau6Amy7mCeTk876h8e4tkNbCUaRPoRaTZnRjN+rVe0lCuhKN4oXjX86VjtVTKryzNcYm1GoBR1wpd9iPzM49aBPJgq7vEbkEOZsX0Rd8Va9fB3agFx3S7RFU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=NwxsDOcdwn8gW7tnJtq/xqhH+vERtSkI1mRl1yCE+L+w3A6TNxVrLpMfCtqOqIXpeCdncYqI6iBZ+946KNtOc9CpMyBTsHkcvhD5MsnJT1GmYW4py6IkHbgmaPemJfSYq3MjOMHAXscUi0FfdkPWV81hrk7X6b8bCsS4TcfSwIA= Received: by 10.114.78.1 with SMTP id a1mr479162wab.1192003511678; Wed, 10 Oct 2007 01:05:11 -0700 (PDT) Received: by 10.114.13.16 with HTTP; Wed, 10 Oct 2007 01:05:11 -0700 (PDT) Message-ID: <6f9f8f4a0710100105u3c71fc37yf9bba68ad213fffd@mail.gmail.com> Date: Wed, 10 Oct 2007 10:05:11 +0200 From: "Loup Vaillant" To: "Vincent Hanquez" Subject: Re: [Caml-list] Re: Rope is the new string Cc: "Jon Harrop" , caml-list@yquem.inria.fr In-Reply-To: <20071010073526.GA21648@snarc.org> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <1191879429.28011.27.camel@rosella.wigram> <20071009.122004.251675959.Christophe.Troestler+ocaml@umh.ac.be> <200710091440.48381.jon@ffconsultancy.com> <20071009155702.GB26282@snarc.org> <6f9f8f4a0710090942u2afe6c5erc2d5b11ecfff2253@mail.gmail.com> <20071009165520.GA27507@snarc.org> <6f9f8f4a0710091032r3dace80bi4f9b584ae5056675@mail.gmail.com> <20071009195119.GA29263@snarc.org> <6f9f8f4a0710091406u445e674ar5a39f66ab7d8ba38@mail.gmail.com> <20071010073526.GA21648@snarc.org> X-Spam: no; 0.00; 0200,:01 icfp:01 10,:98 wrote:01 wrote:01 parsing:01 parsing:01 exception:01 exception:01 opaque:01 caml-list:01 functions:01 strings:01 strings:01 define:02 2007/10/10, Vincent Hanquez : > On Tue, Oct 09, 2007 at 11:06:59PM +0200, Loup Vaillant wrote: > > Err, I would prefer this: > > > >(* big snipet *) > > yes I agree, that what I had in mind, but didn't want to clutter my example. > internally the ustring can hold the format. it should look something like: > > type unicode_type = UTF8 | UTF16 | UTF32 | Latin | ............. > type ustring = unicode_type * string > > of_string : string -> unicode_type -> ustring (* raise if not correct type *) > > which is the same as what you describe, except that you have one parsing > function for every type ;) That's even better. > > print > > (append > > scan_Latin1 > > (of_string text)) > > (* this is not Lisp *) > > > > Just that the sample code you wrote suggest you could have different > > types of unicode strings. I want only one type, so I don't mind the > > encoding, except when reading and printing (to files and native > > strings). To mind even less, you could have a general scan and > > from_string functions which guess which encoding is used (not very > > safe, but cool) > > or have a autodetect format in the of_string format wanted along with > the other encoding ;) But of course. > > > that way when I'm manipulating unicode string, i won't try to append a > > > binary string to a unicode string. I can code safely with my unicode > > > string (whatever the format utf-{8..32}), and certainly expect the type > > > system to complain loudly when doing something that might break unicode. > > > > The exception system can do that, but how could the type system? > > (Would be better if it could.) > > well it does now if you define ustring as an opaque type. you do your > parsing at one place (from string to ustring), at this place it can > raise exception if not a proper format. but once you're manipulating > ustring, it's safe to do whatever you want with them. OK. So, it seem we agree more or less on the interface. Now what about the implementation? Ropes? Flat? I like ropes, personally: catenation is made fast, and look-up are still sub-linear. In general Ropes look cool for functional strings. Last but not the least, they were almost mandatory for saving Endo at the ICFP this year. ;-) Loup Vaillant