From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id CC6B57F02D for ; Sun, 5 Oct 2014 19:19:44 +0200 (CEST) Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of gabriel.scherer@gmail.com) identity=pra; client-ip=209.85.214.170; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="gabriel.scherer@gmail.com"; x-sender="gabriel.scherer@gmail.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of gabriel.scherer@gmail.com designates 209.85.214.170 as permitted sender) identity=mailfrom; client-ip=209.85.214.170; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="gabriel.scherer@gmail.com"; x-sender="gabriel.scherer@gmail.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@mail-ob0-f170.google.com) identity=helo; client-ip=209.85.214.170; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="gabriel.scherer@gmail.com"; x-sender="postmaster@mail-ob0-f170.google.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AosBAJ58MVTRVdaqlWdsb2JhbABeg2FYBIJ+yQyGeYFNCBYBEQEBAQEHDQkJEi6EHBEEGQEbHgMSEA8CJgIkAREBBQEiLgeIBwEDEQ2af4MdboswgXKDEIdgChknDWeGQwEFDoEekheBVAWSeoM2hw6BLTuQGYINGCmDGyCBWzsvAYJJAQEB X-IPAS-Result: AosBAJ58MVTRVdaqlWdsb2JhbABeg2FYBIJ+yQyGeYFNCBYBEQEBAQEHDQkJEi6EHBEEGQEbHgMSEA8CJgIkAREBBQEiLgeIBwEDEQ2af4MdboswgXKDEIdgChknDWeGQwEFDoEekheBVAWSeoM2hw6BLTuQGYINGCmDGyCBWzsvAYJJAQEB X-IronPort-AV: E=Sophos;i="5.04,660,1406584800"; d="scan'208";a="99480324" Received: from mail-ob0-f170.google.com ([209.85.214.170]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 05 Oct 2014 19:19:44 +0200 Received: by mail-ob0-f170.google.com with SMTP id uz6so2970001obc.1 for ; Sun, 05 Oct 2014 10:19:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=2H1EUdTb1Cg83hIkvNVehPd1yQpUcQUUQS/w3Mfva4M=; b=T83j0ghais3u+u93U1phs7mJ6y3y1RMEU0KcBWXYd8zfMQa4WTf2083IA1AILGNmoW xqIyWHSH5ejpmDKs0h6LZ/EiUh+JamX8sKXmuLq+GwklyIxfPJTnuyD/UKNWAfD47ujM sZu15CM6hVZuUeYbbmGciUkyk6Mxs3dQml8ViJu2MCz9kJ39ysL+5A8Sk4vAY42r/9j2 WHJWZBx0foNRNlHJ5rAS5mQyeaqsAcstYUcoO5/f0iPIEoN9SvNwxq3rqJxNavRzJJaB /CCeS2CLoAIquzuNGcj9t2wiLzQd90GAOGNlxXRi13Np0+sgt0qlKjy4oyf4DdCxVZ8z FJSQ== X-Received: by 10.60.97.137 with SMTP id ea9mr21640733oeb.12.1412529582735; Sun, 05 Oct 2014 10:19:42 -0700 (PDT) MIME-Version: 1.0 Received: by 10.76.105.196 with HTTP; Sun, 5 Oct 2014 10:19:02 -0700 (PDT) From: Gabriel Scherer Date: Sun, 5 Oct 2014 19:19:02 +0200 Message-ID: To: caml users Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: [Caml-list] Feedback on -safe-string migration attempts Hi list, I recently converted Extlib to work with safe-string ( the patch can be found in the ocaml-lib-devel archives, http://sourceforge.net/p/ocaml-lib/mailman/message/32877133/ ), and while it mostly went smoothly, there was a pain point that I think would be worth discussing. The question is, when converting an existing library interface, how to decide whether any given part of the API should remain a "string" or be moved to "bytes" ( http://caml.inria.fr/pub/docs/manual-ocaml/libref/Bytes.html ) -- or maybe provide two functions, one for each type. # The problem The new distinction between bytes and string, added in 4.02, actually plays on two different intuitions: - bytes represents (1) mutable (2) sequences of bytes - string represents (1) immutable (2) end-user text (which happen to be represented as sequence of bytes, but we could think of representing them as eg. Javascript strings in the future and with js_of_ocaml, or with ropes, etc.) The problem is that aspects (1) and (2) are somewhat orthogonal. I don't think we're interested in mutable end-user texts, but I encountered a few notable cases of (1) immutable (2) sequences of bytes. The problem is: should those be typed as string, or bytes? (There may be a difference between functions that assume their arguments are immutable, and function that simply guarantee that they won't themselves mutate their arguments. For now I'll assume those two cases count as "immutable sequences of bytes"). Right now, the standard library itself does a strange job of making a choice. The Marshal module ( http://caml.inria.fr/pub/docs/manual-ocaml/libref/Marshal.html ) appears to favor the choice of "bytes" for non-mutated byte sequences (eg. data_size, total_size), while the Digest module ( http://caml.inria.fr/pub/docs/manual-ocaml/libref/Digest.html ) remained in the land of strings. # An ideal solution In an ideal world, I claim the best solution would be the following. Given that it is clear (to me) that mutable byte sequences and immutable byte sequences share the same representation, we should use phantom type to distinguish them: type mut type immut type 'a bytes val get : 'a bytes -> int -> char val set : mut bytes -> int -> char Digest.t =3D immut bytes Using phantom types had been considered at the time of the bytes/string split, but rejected because suddenly adding polymorphism to string literals and string functions broke a lot of code ("The type of this expression, ..., contains a type variable that cannot be generalized", or suddenly-polymorphic method return types). More importantly, we do not want to enforce string and bytes to always have the same underlying representation. Neither arguments hold for mutable/immutable bytes. # Going forward It is probably a bit too late to change the "bytes" type in the compiler standard library. (Well, feel free to disagree on this.) And maybe we don't need to: just as more featureful, higher-level libraries have been developed outside the OCaml distribution, we could think of having a safer, higher-level phantom representation of byte sequences, as an external library. Regardless of what we do about this, I would recommend that immutable byte sequences (things that are, by design, not text) be represented as "bytes" rather than "string"=C2=B9. If/whenever a consensus on a safer phantom representation appear, it will be possible to convert to it without changing the representation. Similarly, if your bytes-taking function does not mutate or capture its input, you should mention it informally in its specification/documentation (and maybe express this with a phantom type later): this is important to reason about, for example, (un)safe conversions on those byte sequences. =C2=B9: a dissenting opinion could suggest that it is more important to get the type-checker help re. mutability than expose the distinction between byte-level data and text (which should be an abstract type in some UTF8 library anyway), and thus immutable anything should rather be "string". I think the phantom type approach is superior, and we should design interfaces with it in mind.