From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gabriel.scherer@gmail.com>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83])
	by sympa.inria.fr (Postfix) with ESMTPS id CC6B57F02D
	for <caml-list@sympa.inria.fr>; Sun,  5 Oct 2014 19:19:44 +0200 (CEST)
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  gabriel.scherer@gmail.com) identity=pra;
  client-ip=209.85.214.170;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="gabriel.scherer@gmail.com";
  x-sender="gabriel.scherer@gmail.com";
  x-conformance=sidf_compatible
Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of
  gabriel.scherer@gmail.com designates 209.85.214.170 as
  permitted sender) identity=mailfrom;
  client-ip=209.85.214.170;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="gabriel.scherer@gmail.com";
  x-sender="gabriel.scherer@gmail.com";
  x-conformance=sidf_compatible; x-record-type="v=spf1"
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  postmaster@mail-ob0-f170.google.com) identity=helo;
  client-ip=209.85.214.170;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="gabriel.scherer@gmail.com";
  x-sender="postmaster@mail-ob0-f170.google.com";
  x-conformance=sidf_compatible
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AosBAJ58MVTRVdaqlWdsb2JhbABeg2FYBIJ+yQyGeYFNCBYBEQEBAQEHDQkJEi6EHBEEGQEbHgMSEA8CJgIkAREBBQEiLgeIBwEDEQ2af4MdboswgXKDEIdgChknDWeGQwEFDoEekheBVAWSeoM2hw6BLTuQGYINGCmDGyCBWzsvAYJJAQEB
X-IPAS-Result: AosBAJ58MVTRVdaqlWdsb2JhbABeg2FYBIJ+yQyGeYFNCBYBEQEBAQEHDQkJEi6EHBEEGQEbHgMSEA8CJgIkAREBBQEiLgeIBwEDEQ2af4MdboswgXKDEIdgChknDWeGQwEFDoEekheBVAWSeoM2hw6BLTuQGYINGCmDGyCBWzsvAYJJAQEB
X-IronPort-AV: E=Sophos;i="5.04,660,1406584800"; 
   d="scan'208";a="99480324"
Received: from mail-ob0-f170.google.com ([209.85.214.170])
  by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 05 Oct 2014 19:19:44 +0200
Received: by mail-ob0-f170.google.com with SMTP id uz6so2970001obc.1
        for <caml-list@inria.fr>; Sun, 05 Oct 2014 10:19:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=mime-version:from:date:message-id:subject:to:content-type
         :content-transfer-encoding;
        bh=2H1EUdTb1Cg83hIkvNVehPd1yQpUcQUUQS/w3Mfva4M=;
        b=T83j0ghais3u+u93U1phs7mJ6y3y1RMEU0KcBWXYd8zfMQa4WTf2083IA1AILGNmoW
         xqIyWHSH5ejpmDKs0h6LZ/EiUh+JamX8sKXmuLq+GwklyIxfPJTnuyD/UKNWAfD47ujM
         sZu15CM6hVZuUeYbbmGciUkyk6Mxs3dQml8ViJu2MCz9kJ39ysL+5A8Sk4vAY42r/9j2
         WHJWZBx0foNRNlHJ5rAS5mQyeaqsAcstYUcoO5/f0iPIEoN9SvNwxq3rqJxNavRzJJaB
         /CCeS2CLoAIquzuNGcj9t2wiLzQd90GAOGNlxXRi13Np0+sgt0qlKjy4oyf4DdCxVZ8z
         FJSQ==
X-Received: by 10.60.97.137 with SMTP id ea9mr21640733oeb.12.1412529582735;
 Sun, 05 Oct 2014 10:19:42 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.76.105.196 with HTTP; Sun, 5 Oct 2014 10:19:02 -0700 (PDT)
From: Gabriel Scherer <gabriel.scherer@gmail.com>
Date: Sun, 5 Oct 2014 19:19:02 +0200
Message-ID: <CAPFanBEXe1rVifeL2fYH-hbcLXJMf3zWjqP3-K1Zhhh7L7p61Q@mail.gmail.com>
To: caml users <caml-list@inria.fr>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Subject: [Caml-list] Feedback on -safe-string migration attempts

Hi list,

I recently converted Extlib to work with safe-string ( the patch can
be found in the ocaml-lib-devel archives,
http://sourceforge.net/p/ocaml-lib/mailman/message/32877133/ ), and
while it mostly went smoothly, there was a pain point that I think
would be worth discussing.

The question is, when converting an existing library interface, how to
decide whether any given part of the API should remain a "string" or
be moved to "bytes" (
http://caml.inria.fr/pub/docs/manual-ocaml/libref/Bytes.html ) -- or
maybe provide two functions, one for each type.

# The problem

The new distinction between bytes and string, added in 4.02, actually
plays on two different intuitions:
- bytes represents (1) mutable (2) sequences of bytes
- string represents (1) immutable (2) end-user text (which happen to
be represented as sequence of bytes, but we could think of
representing them as eg. Javascript strings in the future and with
js_of_ocaml, or with ropes, etc.)

The problem is that aspects (1) and (2) are somewhat orthogonal. I
don't think we're interested in mutable end-user texts, but I
encountered a few notable cases of (1) immutable (2) sequences of
bytes. The problem is: should those be typed as string, or bytes?

(There may be a difference between functions that assume their
arguments are immutable, and function that simply guarantee that they
won't themselves mutate their arguments. For now I'll assume those two
cases count as "immutable sequences of bytes").

Right now, the standard library itself does a strange job of making a
choice. The Marshal module (
http://caml.inria.fr/pub/docs/manual-ocaml/libref/Marshal.html )
appears to favor the choice of "bytes" for non-mutated byte sequences
(eg. data_size, total_size), while the Digest module (
http://caml.inria.fr/pub/docs/manual-ocaml/libref/Digest.html )
remained in the land of strings.


# An ideal solution

In an ideal world, I claim the best solution would be the following.
Given that it is clear (to me) that mutable byte sequences and
immutable byte sequences share the same representation, we should use
phantom type to distinguish them:

  type mut
  type immut
  type 'a bytes

  val get : 'a bytes -> int -> char
  val set : mut bytes -> int -> char
  Digest.t =3D immut bytes

Using phantom types had been considered at the time of the
bytes/string split, but rejected because suddenly adding polymorphism
to string literals and string functions broke a lot of code ("The type
of this expression, ..., contains a type variable that cannot be
generalized", or suddenly-polymorphic method return types). More
importantly, we do not want to enforce string and bytes to always have
the same underlying representation. Neither arguments hold for
mutable/immutable bytes.


# Going forward

It is probably a bit too late to change the "bytes" type in the
compiler standard library. (Well, feel free to disagree on this.)
And maybe we don't need to: just as more featureful, higher-level
libraries have been developed outside the OCaml distribution, we could
think of having a safer, higher-level phantom representation of byte
sequences, as an external library.

Regardless of what we do about this, I would recommend that immutable
byte sequences (things that are, by design, not text) be represented
as "bytes" rather than "string"=C2=B9. If/whenever a consensus on a safer
phantom representation appear, it will be possible to convert to it
without changing the representation.
Similarly, if your bytes-taking function does not mutate or capture
its input, you should mention it informally in its
specification/documentation (and maybe express this with a phantom
type later): this is important to reason about, for example, (un)safe
conversions on those byte sequences.

=C2=B9: a dissenting opinion could suggest that it is more important to get
the type-checker help re. mutability than expose the distinction
between byte-level data and text (which should be an abstract type in
some UTF8 library anyway), and thus immutable anything should rather
be "string". I think the phantom type approach is superior, and we
should design interfaces with it in mind.