From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <loup.vaillant@gmail.com>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: *
X-Spam-Status: No, score=1.0 required=5.0 tests=AWL,SPF_NEUTRAL 
	autolearn=disabled version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105])
	by yquem.inria.fr (Postfix) with ESMTP id 0B825BC6B
	for <caml-list@yquem.inria.fr>; Tue,  9 Oct 2007 23:07:05 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AgAAAI6KC0fRVZKykmdsb2JhbACOSAIBAQcEBBMW
X-IronPort-AV: E=Sophos;i="4.21,250,1188770400"; 
   d="scan'208";a="17770948"
Received: from wa-out-1112.google.com ([209.85.146.178])
  by mail4-smtp-sop.national.inria.fr with ESMTP; 09 Oct 2007 23:07:01 +0200
Received: by wa-out-1112.google.com with SMTP id k17so2247304waf
        for <caml-list@yquem.inria.fr>; Tue, 09 Oct 2007 14:07:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=beta;
        h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        bh=rVDB69L/Pxy517zHSWpoPZvY3EVfI/kOgZK0UwAr1UU=;
        b=YDt+adggfvFMaabeTXV8ckd9OaZ5ZiHFutLbqV4QIUHAA15Y9G3HxJ+D+TT5MnOH5B6vHXEGkeMQbNXkkkhkcFQEGVBdwgBOa2TXP/3d+2R0F2N45biiV0LHe8is7SvPtP/zrjrsZ8LqwXEs2QLuPduW8v1uB9tu4YLg3g7facY=
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=Lz+oX8G6wzgehzeCGDURPXJrOpdspHa+b9dJEkws+ACqwFZUxJt5Tc7vcMHrOQ78Vr6RjiGBmtNc71sPVBYxxTw9VkBSEC9xJO2+x7x2TOmnP42tFkYu8bp5B2cBzw+4zpYhI9YYEgmms/lMxaMzfZqp2m0og8Cdpr5gp3ykC0w=
Received: by 10.114.174.2 with SMTP id w2mr94134wae.1191964019860;
        Tue, 09 Oct 2007 14:06:59 -0700 (PDT)
Received: by 10.114.13.16 with HTTP; Tue, 9 Oct 2007 14:06:59 -0700 (PDT)
Message-ID: <6f9f8f4a0710091406u445e674ar5a39f66ab7d8ba38@mail.gmail.com>
Date: Tue, 9 Oct 2007 23:06:59 +0200
From: "Loup Vaillant" <loup.vaillant@gmail.com>
To: "Vincent Hanquez" <tab@snarc.org>
Subject: Re: [Caml-list] Re: Rope is the new string
Cc: "Jon Harrop" <jon@ffconsultancy.com>, caml-list@yquem.inria.fr
In-Reply-To: <20071009195119.GA29263@snarc.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <1191879429.28011.27.camel@rosella.wigram>
	 <1191886669.26491.29.camel@rosella.wigram>
	 <20071009.122004.251675959.Christophe.Troestler+ocaml@umh.ac.be>
	 <200710091440.48381.jon@ffconsultancy.com>
	 <20071009155702.GB26282@snarc.org>
	 <6f9f8f4a0710090942u2afe6c5erc2d5b11ecfff2253@mail.gmail.com>
	 <20071009165520.GA27507@snarc.org>
	 <6f9f8f4a0710091032r3dace80bi4f9b584ae5056675@mail.gmail.com>
	 <20071009195119.GA29263@snarc.org>
X-Spam: no; 0.00; 0200,:01 appending:01 ocaml:01 byte:01 arrays:01 ocaml:01 byte:01 subset:01 encodings:01 wrote:01 wrote:01 partial:01 abstract:01 syntactic:01 exception:01 

2007/10/9, Vincent Hanquez <tab@snarc.org>:
> On Tue, Oct 09, 2007 at 07:32:25PM +0200, Loup Vaillant wrote:
> > > definitely we also need some UTFstring type library (which can use rope,
> > > string, whatever internally), with all common type of operations
> > > (appending, finding, ...), but it's a just a specific sub case and also
> > > a different type not compatible with strings (in OCaml terminology).
> >
> > Then, we should have both byte arrays (the native Ocaml strings), and
> > unicode strings. We will also need proper syntactic sugar for unicode
> > strings. Operators, and literal values (like #"example"). Only then,
> > ropes could feel like native strings --and be useful as such.
>
> not sure If i see your point here, since your are mixing rope and
> unicode. however I think we are missing some other type of string
> implementation (maybe rope) *along* the current implementation of
> string.

Sorry. I just saw ropes as a possible implementation for unicode
strings (because there is no standard yet, we have the choice). By the
way, there is a good chance they are better than flat string for most
purposes.

> while we also miss unicode support somehow integrated, what
> implementation of the underlaying basic byte string is used, is
> irrevelant.

I agree.

> > > [...] it's a just a specific sub case [...]
> >
> > Internationalization is, mere text crunching is not. (You meant that,
> > right?) With properly interfaced unicode strings, I can do my text
> > crunching without worrying about internationalization, and with no
> > programming overhead. Then, when (if) I have to internationalize, it
> > is much easier.
>
> Absolutely. What I meant basicly resume into, that unicode strings are
> just a subset of strings (as array of bytes). you can store a unicode
> string in a byte string, whereas you can't store a byte string into a
> unicode string.
>
> i want a UTF library to be able to do something like:
>
> type ustring = unicode_type * string
> of_string: string -> ustring (* raise if not unicode compliant *)
> to_string: ustring -> string
> append: ustring -> ustring -> ustring
> ...etc

Err, I would prefer this:

type ustring (* the type should be abstract, I think *)
of_string_utf8: string -> ustring (* raise if not utf8 compliant *)
to_string_utf8: ustring -> string
of_string_utf16: string -> ustring (* raise if not utf16 compliant *)
to_string_utf16: ustring -> string
of_string_latin1: string -> ustring (* raise if not latin1 compliant *)
to_string_latin1: ustring -> string (* raise if characters not encoded
in Latin1 (the exeption should contain a usefull result) *)
(* etc for each encoding *)
scan_utf8 (* raise if not utf8 compliant (but do not lose a possible
partial result) *)
print_utf8
(* etc for each encoding *)

append: ustring -> ustring -> ustring
(* etc *) (* I agree on these ones *)

So I can mix-up different encodings:

print
  (append
    scan_Latin1
    (of_string text))
(* this is not Lisp *)

Just that the sample code you wrote suggest you could have different
types of unicode strings. I want only one type, so I don't mind the
encoding, except when reading and printing (to files and native
strings). To mind even less, you could have a general scan and
from_string functions which guess which encoding is used (not very
safe, but cool)

> that way when I'm manipulating unicode string, i won't try to append a
> binary string to a unicode string. I can code safely with my unicode
> string (whatever the format utf-{8..32}), and certainly expect the type
> system to complain loudly when doing something that might break unicode.

The exception system can do that, but how could the type system?
(Would be better if it could.)

Loup Vaillant