From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gabriel.scherer@gmail.com>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83])
	by sympa.inria.fr (Postfix) with ESMTPS id 145CB7F057
	for <caml-list@sympa.inria.fr>; Mon,  6 Oct 2014 13:09:41 +0200 (CEST)
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  gabriel.scherer@gmail.com) identity=pra;
  client-ip=209.85.214.181;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="gabriel.scherer@gmail.com";
  x-sender="gabriel.scherer@gmail.com";
  x-conformance=sidf_compatible
Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of
  gabriel.scherer@gmail.com designates 209.85.214.181 as
  permitted sender) identity=mailfrom;
  client-ip=209.85.214.181;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="gabriel.scherer@gmail.com";
  x-sender="gabriel.scherer@gmail.com";
  x-conformance=sidf_compatible; x-record-type="v=spf1"
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  postmaster@mail-ob0-f181.google.com) identity=helo;
  client-ip=209.85.214.181;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="gabriel.scherer@gmail.com";
  x-sender="postmaster@mail-ob0-f181.google.com";
  x-conformance=sidf_compatible
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: ArUBAG13MlTRVda1m2dsb2JhbABfDoNTWASCfskbhnlUAn0IFgERAQEBAQEGCwsJFCyEBAEBBBIRBBkBGx0BAwwGBQsNAgImAgIiAREBBQEcBhMbB4gHAQMRDZ86boswgXKDEIdBChknDWeGMRIBBQ6BHokdhRQ1MweCd4FUBZJ6gzaHDoEtO4YxiWiCDRgpgxsggRlCOy8BgkkBAQE
X-IPAS-Result: ArUBAG13MlTRVda1m2dsb2JhbABfDoNTWASCfskbhnlUAn0IFgERAQEBAQEGCwsJFCyEBAEBBBIRBBkBGx0BAwwGBQsNAgImAgIiAREBBQEcBhMbB4gHAQMRDZ86boswgXKDEIdBChknDWeGMRIBBQ6BHokdhRQ1MweCd4FUBZJ6gzaHDoEtO4YxiWiCDRgpgxsggRlCOy8BgkkBAQE
X-IronPort-AV: E=Sophos;i="5.04,663,1406584800"; 
   d="scan'208";a="99584499"
Received: from mail-ob0-f181.google.com ([209.85.214.181])
  by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 06 Oct 2014 13:09:39 +0200
Received: by mail-ob0-f181.google.com with SMTP id wm4so3666142obc.40
        for <caml-list@inria.fr>; Mon, 06 Oct 2014 04:09:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc:content-type:content-transfer-encoding;
        bh=TKaW0X/AGqQltr9z9KudpQTrJIA7ok9/E7p2XbiFftU=;
        b=OXiifziRJEjOxJm6vXYODl7bIoqb4ILVaV4xmk3TXz8suUlKYF93fDKtm+VLYyGIcE
         omJdqvsSzRy+HjkUEaJjYbMa96AcYfLgxX/XihlkhmKtfRX+C3o0Xd5cHdp0xxPUvZ1G
         PJ7FL+W8wDdk2L/XpuJ2rvJL63cY12WxSUcTtpKuud2t4NMxxggF2VZ3qRKl3NdzRxaE
         G9jwp3cSdL0UNbhEQ7tXTfwXkQNPhExGaOENP91bFvTCN7vRpfMyY6IWkH3np020rABB
         Fsub9zIyGIWVwsrtaWWEI4qh2PZDPOCApES4E+rB1rj4kq5cay9TXZEsTWCcJxzZ/5xe
         Z1QA==
X-Received: by 10.60.45.7 with SMTP id i7mr26964121oem.2.1412593778887; Mon,
 06 Oct 2014 04:09:38 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.76.105.196 with HTTP; Mon, 6 Oct 2014 04:08:58 -0700 (PDT)
In-Reply-To: <8BB0EB51-3516-4CB9-94EC-513BA87CD4FF@math.nagoya-u.ac.jp>
References: <CAPFanBEXe1rVifeL2fYH-hbcLXJMf3zWjqP3-K1Zhhh7L7p61Q@mail.gmail.com>
 <8BB0EB51-3516-4CB9-94EC-513BA87CD4FF@math.nagoya-u.ac.jp>
From: Gabriel Scherer <gabriel.scherer@gmail.com>
Date: Mon, 6 Oct 2014 13:08:58 +0200
Message-ID: <CAPFanBGkJEGFkXzLh=fnhHPEFhX4m3je+Xg12g0Rp_YBUyGihg@mail.gmail.com>
To: Jacques Garrigue <garrigue@math.nagoya-u.ac.jp>
Cc: OCaML List Mailing <caml-list@inria.fr>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Caml-list] Feedback on -safe-string migration attempts

I'm strongly convinced that allowing a difference in representation is
the right choice. I'm not actively pushing for using another
representation, but in my experience thinking of them as distinct is
a very good thought-experiment to design better interfaces.

I would compare this to the idea that "data created at an immutable
type might be allocated in read-only memory": I'm not pushing for this
to be implemented, but it's an excellent thought-experiment to explain
why certains use of Obj.magic are assuredly wrong.

(In my experience so far working with the stdlib, Extlib and
Batteries, I have not felt that the distinct representations were
painful. In particular, I needed Bytes.get rarely enough that the lack
of s.[n] syntax was never an issue.)


I would thus rather suggest the following interfaces:
  type 'a bytes (* =3D string  under -unsafe-string *)
  type string

An interesting interface change would be the conversion functions. We
currently have:

  Bytes.of_string : string -> bytes
  Bytes.to_string : bytes -> string

  Bytes.copy : bytes -> bytes

  (* see Bytes documentation *)
  Bytes.unsafe_of_string : string -> bytes
  Bytes.unsafe_to_string : bytes -> string

We could instead have something like the following:

  Bytes.immut_of_string : string -> immut bytes
  Bytes.immut_to_string : immut bytes -> string

  Bytes.copy : 'a bytes -> 'b bytes

  (* same usage restrinction than Bytes.unsafe_{of,to}_string,
     see the documentation *)
  Bytes.unsafe_mut : immut bytes -> mut bytes
  Bytes.unsafe_immut : bytes bytes -> immut bytes

with the aliases

  Bytes.of_string : string -> 'a bytes
  of_string s =3D copy (immut_of_string s)

  Bytes.to_string : 'a bytes -> string
  to_string s =3D immut_to_string (copy s)

  Bytes.unsafe_to_string : mut bytes -> string
  unsafe_to_string s =3D immut_to_string (unsafe_immut s)

(unsafe_of_string could go away: it can only be correctly used if the
resulting bytes is used immutably, so it is superseded by the safe
immut_of_string function.)

On Mon, Oct 6, 2014 at 4:11 AM, Jacques Garrigue
<garrigue@math.nagoya-u.ac.jp> wrote:
> Hi Gabriel,
>
> I think this is an interesting proposal.
> We didn=E2=80=99t consider it when adding the -safe-string option, but I =
do
> not think it is too late to change things in the compiler:
> keeping compatibility with pre-4.02 code is essential, but Bytes
> itself is still experimental, so changing the bytes type should be ok.
>
> Actually, I think no decision was reached whether the ability to have
> different internal representations for string and bytes is really importa=
nt.
> This may matter when you use javascript as backend, but otherwise?
> If we decided to keep the same representation by default, then a
> reasonable approach would be to adopt your proposal with the following
> extra twist:
>
> * in safe-string mode, string is an alias for immut bytes
>         type 'a bytes
>         type string =3D immut bytes
> * in legacy mode, bytes is an alias for string
>         type string
>         type 'a bytes =3D string
>
> To keep good compatibility, the functions in the String module would
> only have monomorphic types.
> The notation "s.[n]" is a subtle case, but a solution could be to have it
> expanded to the monomorphic String.get in legacy mode, and to the polymor=
phic
> Bytes.get in safe-string mode. This would allow to keep the "s.[n] <- e=
=E2=80=9D
> notation too.
>
> I think this would be more comfortable to use than the current state.
>
>         Jacques
>
> On 2014/10/06 02:19, Gabriel Scherer wrote:
>>
>> Hi list,
>>
>> I recently converted Extlib to work with safe-string ( the patch can
>> be found in the ocaml-lib-devel archives,
>> http://sourceforge.net/p/ocaml-lib/mailman/message/32877133/ ), and
>> while it mostly went smoothly, there was a pain point that I think
>> would be worth discussing.
>>
>> The question is, when converting an existing library interface, how to
>> decide whether any given part of the API should remain a "string" or
>> be moved to "bytes" (
>> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Bytes.html ) -- or
>> maybe provide two functions, one for each type.
>>
>> # The problem
>>
>> The new distinction between bytes and string, added in 4.02, actually
>> plays on two different intuitions:
>> - bytes represents (1) mutable (2) sequences of bytes
>> - string represents (1) immutable (2) end-user text (which happen to
>> be represented as sequence of bytes, but we could think of
>> representing them as eg. Javascript strings in the future and with
>> js_of_ocaml, or with ropes, etc.)
>>
>> The problem is that aspects (1) and (2) are somewhat orthogonal. I
>> don't think we're interested in mutable end-user texts, but I
>> encountered a few notable cases of (1) immutable (2) sequences of
>> bytes. The problem is: should those be typed as string, or bytes?
>>
>> (There may be a difference between functions that assume their
>> arguments are immutable, and function that simply guarantee that they
>> won't themselves mutate their arguments. For now I'll assume those two
>> cases count as "immutable sequences of bytes").
>>
>> Right now, the standard library itself does a strange job of making a
>> choice. The Marshal module (
>> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Marshal.html )
>> appears to favor the choice of "bytes" for non-mutated byte sequences
>> (eg. data_size, total_size), while the Digest module (
>> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Digest.html )
>> remained in the land of strings.
>>
>>
>> # An ideal solution
>>
>> In an ideal world, I claim the best solution would be the following.
>> Given that it is clear (to me) that mutable byte sequences and
>> immutable byte sequences share the same representation, we should use
>> phantom type to distinguish them:
>>
>>  type mut
>>  type immut
>>  type 'a bytes
>>
>>  val get : 'a bytes -> int -> char
>>  val set : mut bytes -> int -> char
>>  Digest.t =3D immut bytes
>>
>> Using phantom types had been considered at the time of the
>> bytes/string split, but rejected because suddenly adding polymorphism
>> to string literals and string functions broke a lot of code ("The type
>> of this expression, ..., contains a type variable that cannot be
>> generalized", or suddenly-polymorphic method return types). More
>> importantly, we do not want to enforce string and bytes to always have
>> the same underlying representation. Neither arguments hold for
>> mutable/immutable bytes.
>>
>>
>> # Going forward
>>
>> It is probably a bit too late to change the "bytes" type in the
>> compiler standard library. (Well, feel free to disagree on this.)
>> And maybe we don't need to: just as more featureful, higher-level
>> libraries have been developed outside the OCaml distribution, we could
>> think of having a safer, higher-level phantom representation of byte
>> sequences, as an external library.
>>
>> Regardless of what we do about this, I would recommend that immutable
>> byte sequences (things that are, by design, not text) be represented
>> as "bytes" rather than "string"=C2=B9. If/whenever a consensus on a safer
>> phantom representation appear, it will be possible to convert to it
>> without changing the representation.
>> Similarly, if your bytes-taking function does not mutate or capture
>> its input, you should mention it informally in its
>> specification/documentation (and maybe express this with a phantom
>> type later): this is important to reason about, for example, (un)safe
>> conversions on those byte sequences.
>>
>> =C2=B9: a dissenting opinion could suggest that it is more important to =
get
>> the type-checker help re. mutability than expose the distinction
>> between byte-level data and text (which should be an abstract type in
>> some UTF8 library anyway), and thus immutable anything should rather
>> be "string". I think the phantom type approach is superior, and we
>> should design interfaces with it in mind.
>
>
>