From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tab@snarc.org>
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.1.3
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by yquem.inria.fr (Postfix) with ESMTP id 77A00BC6B
	for <caml-list@yquem.inria.fr>; Wed, 10 Oct 2007 09:35:28 +0200 (CEST)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ao8CAKUdDEfUVZgL/2dsb2JhbAA
X-IronPort-AV: E=Sophos;i="4.21,252,1188770400"; 
   d="scan'208";a="4296303"
Received: from hades.snarc.org ([212.85.152.11])
  by mail3-smtp-sop.national.inria.fr with ESMTP; 10 Oct 2007 09:35:28 +0200
Received: by hades.snarc.org (Postfix, from userid 1000)
	id B86021B4D8; Wed, 10 Oct 2007 09:35:26 +0200 (CEST)
Date: Wed, 10 Oct 2007 09:35:26 +0200
To: Loup Vaillant <loup.vaillant@gmail.com>
Cc: Jon Harrop <jon@ffconsultancy.com>, caml-list@yquem.inria.fr
Subject: Re: [Caml-list] Re: Rope is the new string
Message-ID: <20071010073526.GA21648@snarc.org>
References: <1191879429.28011.27.camel@rosella.wigram> <1191886669.26491.29.camel@rosella.wigram> <20071009.122004.251675959.Christophe.Troestler+ocaml@umh.ac.be> <200710091440.48381.jon@ffconsultancy.com> <20071009155702.GB26282@snarc.org> <6f9f8f4a0710090942u2afe6c5erc2d5b11ecfff2253@mail.gmail.com> <20071009165520.GA27507@snarc.org> <6f9f8f4a0710091032r3dace80bi4f9b584ae5056675@mail.gmail.com> <20071009195119.GA29263@snarc.org> <6f9f8f4a0710091406u445e674ar5a39f66ab7d8ba38@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <6f9f8f4a0710091406u445e674ar5a39f66ab7d8ba38@mail.gmail.com>
X-Warning: Email may contain unsmilyfied humor and/or satire.
User-Agent: Mutt/1.5.16 (2007-06-11)
From: tab@snarc.org (Vincent Hanquez)
X-Spam: no; 0.00; 0200,:01 encodings:01 wrote:01 wrote:01 partial:01 abstract:01 parsing:01 parsing:01 exception:01 exception:01 opaque:01 caml-list:01 functions:01 strings:01 strings:01 

On Tue, Oct 09, 2007 at 11:06:59PM +0200, Loup Vaillant wrote:
> Err, I would prefer this:
> 
> type ustring (* the type should be abstract, I think *)
> of_string_utf8: string -> ustring (* raise if not utf8 compliant *)
> to_string_utf8: ustring -> string
> of_string_utf16: string -> ustring (* raise if not utf16 compliant *)
> to_string_utf16: ustring -> string
> of_string_latin1: string -> ustring (* raise if not latin1 compliant *)
> to_string_latin1: ustring -> string (* raise if characters not encoded
> in Latin1 (the exeption should contain a usefull result) *)
> (* etc for each encoding *)
> scan_utf8 (* raise if not utf8 compliant (but do not lose a possible
> partial result) *)
> print_utf8
> (* etc for each encoding *)

yes I agree, that what I had in mind, but didn't want to clutter my example.
internally the ustring can hold the format. it should look something like:

type unicode_type = UTF8 | UTF16 | UTF32 | Latin | .............
type ustring = unicode_type * string

of_string : string -> unicode_type -> ustring (* raise if not correct type *)

which is the same as what you describe, except that you have one parsing
function for every type ;)

> append: ustring -> ustring -> ustring
> (* etc *) (* I agree on these ones *)
> 
> So I can mix-up different encodings:

yes it would be ideal.

> print
>   (append
>     scan_Latin1
>     (of_string text))
> (* this is not Lisp *)
> 
> Just that the sample code you wrote suggest you could have different
> types of unicode strings. I want only one type, so I don't mind the
> encoding, except when reading and printing (to files and native
> strings). To mind even less, you could have a general scan and
> from_string functions which guess which encoding is used (not very
> safe, but cool)

or have a autodetect format in the of_string format wanted along with
the other encoding ;)

> > that way when I'm manipulating unicode string, i won't try to append a
> > binary string to a unicode string. I can code safely with my unicode
> > string (whatever the format utf-{8..32}), and certainly expect the type
> > system to complain loudly when doing something that might break unicode.
> 
> The exception system can do that, but how could the type system?
> (Would be better if it could.)

well it does now if you define ustring as an opaque type. you do your
parsing at one place (from string to ustring), at this place it can
raise exception if not a proper format. but once you're manipulating
ustring, it's safe to do whatever you want with them.

-- 
Vincent Hanquez