From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id 145CB7F057 for ; Mon, 6 Oct 2014 13:09:41 +0200 (CEST) Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of gabriel.scherer@gmail.com) identity=pra; client-ip=209.85.214.181; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="gabriel.scherer@gmail.com"; x-sender="gabriel.scherer@gmail.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of gabriel.scherer@gmail.com designates 209.85.214.181 as permitted sender) identity=mailfrom; client-ip=209.85.214.181; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="gabriel.scherer@gmail.com"; x-sender="gabriel.scherer@gmail.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@mail-ob0-f181.google.com) identity=helo; client-ip=209.85.214.181; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="gabriel.scherer@gmail.com"; x-sender="postmaster@mail-ob0-f181.google.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ArUBAG13MlTRVda1m2dsb2JhbABfDoNTWASCfskbhnlUAn0IFgERAQEBAQEGCwsJFCyEBAEBBBIRBBkBGx0BAwwGBQsNAgImAgIiAREBBQEcBhMbB4gHAQMRDZ86boswgXKDEIdBChknDWeGMRIBBQ6BHokdhRQ1MweCd4FUBZJ6gzaHDoEtO4YxiWiCDRgpgxsggRlCOy8BgkkBAQE X-IPAS-Result: ArUBAG13MlTRVda1m2dsb2JhbABfDoNTWASCfskbhnlUAn0IFgERAQEBAQEGCwsJFCyEBAEBBBIRBBkBGx0BAwwGBQsNAgImAgIiAREBBQEcBhMbB4gHAQMRDZ86boswgXKDEIdBChknDWeGMRIBBQ6BHokdhRQ1MweCd4FUBZJ6gzaHDoEtO4YxiWiCDRgpgxsggRlCOy8BgkkBAQE X-IronPort-AV: E=Sophos;i="5.04,663,1406584800"; d="scan'208";a="99584499" Received: from mail-ob0-f181.google.com ([209.85.214.181]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 06 Oct 2014 13:09:39 +0200 Received: by mail-ob0-f181.google.com with SMTP id wm4so3666142obc.40 for ; Mon, 06 Oct 2014 04:09:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; bh=TKaW0X/AGqQltr9z9KudpQTrJIA7ok9/E7p2XbiFftU=; b=OXiifziRJEjOxJm6vXYODl7bIoqb4ILVaV4xmk3TXz8suUlKYF93fDKtm+VLYyGIcE omJdqvsSzRy+HjkUEaJjYbMa96AcYfLgxX/XihlkhmKtfRX+C3o0Xd5cHdp0xxPUvZ1G PJ7FL+W8wDdk2L/XpuJ2rvJL63cY12WxSUcTtpKuud2t4NMxxggF2VZ3qRKl3NdzRxaE G9jwp3cSdL0UNbhEQ7tXTfwXkQNPhExGaOENP91bFvTCN7vRpfMyY6IWkH3np020rABB Fsub9zIyGIWVwsrtaWWEI4qh2PZDPOCApES4E+rB1rj4kq5cay9TXZEsTWCcJxzZ/5xe Z1QA== X-Received: by 10.60.45.7 with SMTP id i7mr26964121oem.2.1412593778887; Mon, 06 Oct 2014 04:09:38 -0700 (PDT) MIME-Version: 1.0 Received: by 10.76.105.196 with HTTP; Mon, 6 Oct 2014 04:08:58 -0700 (PDT) In-Reply-To: <8BB0EB51-3516-4CB9-94EC-513BA87CD4FF@math.nagoya-u.ac.jp> References: <8BB0EB51-3516-4CB9-94EC-513BA87CD4FF@math.nagoya-u.ac.jp> From: Gabriel Scherer Date: Mon, 6 Oct 2014 13:08:58 +0200 Message-ID: To: Jacques Garrigue Cc: OCaML List Mailing Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Caml-list] Feedback on -safe-string migration attempts I'm strongly convinced that allowing a difference in representation is the right choice. I'm not actively pushing for using another representation, but in my experience thinking of them as distinct is a very good thought-experiment to design better interfaces. I would compare this to the idea that "data created at an immutable type might be allocated in read-only memory": I'm not pushing for this to be implemented, but it's an excellent thought-experiment to explain why certains use of Obj.magic are assuredly wrong. (In my experience so far working with the stdlib, Extlib and Batteries, I have not felt that the distinct representations were painful. In particular, I needed Bytes.get rarely enough that the lack of s.[n] syntax was never an issue.) I would thus rather suggest the following interfaces: type 'a bytes (* =3D string under -unsafe-string *) type string An interesting interface change would be the conversion functions. We currently have: Bytes.of_string : string -> bytes Bytes.to_string : bytes -> string Bytes.copy : bytes -> bytes (* see Bytes documentation *) Bytes.unsafe_of_string : string -> bytes Bytes.unsafe_to_string : bytes -> string We could instead have something like the following: Bytes.immut_of_string : string -> immut bytes Bytes.immut_to_string : immut bytes -> string Bytes.copy : 'a bytes -> 'b bytes (* same usage restrinction than Bytes.unsafe_{of,to}_string, see the documentation *) Bytes.unsafe_mut : immut bytes -> mut bytes Bytes.unsafe_immut : bytes bytes -> immut bytes with the aliases Bytes.of_string : string -> 'a bytes of_string s =3D copy (immut_of_string s) Bytes.to_string : 'a bytes -> string to_string s =3D immut_to_string (copy s) Bytes.unsafe_to_string : mut bytes -> string unsafe_to_string s =3D immut_to_string (unsafe_immut s) (unsafe_of_string could go away: it can only be correctly used if the resulting bytes is used immutably, so it is superseded by the safe immut_of_string function.) On Mon, Oct 6, 2014 at 4:11 AM, Jacques Garrigue wrote: > Hi Gabriel, > > I think this is an interesting proposal. > We didn=E2=80=99t consider it when adding the -safe-string option, but I = do > not think it is too late to change things in the compiler: > keeping compatibility with pre-4.02 code is essential, but Bytes > itself is still experimental, so changing the bytes type should be ok. > > Actually, I think no decision was reached whether the ability to have > different internal representations for string and bytes is really importa= nt. > This may matter when you use javascript as backend, but otherwise? > If we decided to keep the same representation by default, then a > reasonable approach would be to adopt your proposal with the following > extra twist: > > * in safe-string mode, string is an alias for immut bytes > type 'a bytes > type string =3D immut bytes > * in legacy mode, bytes is an alias for string > type string > type 'a bytes =3D string > > To keep good compatibility, the functions in the String module would > only have monomorphic types. > The notation "s.[n]" is a subtle case, but a solution could be to have it > expanded to the monomorphic String.get in legacy mode, and to the polymor= phic > Bytes.get in safe-string mode. This would allow to keep the "s.[n] <- e= =E2=80=9D > notation too. > > I think this would be more comfortable to use than the current state. > > Jacques > > On 2014/10/06 02:19, Gabriel Scherer wrote: >> >> Hi list, >> >> I recently converted Extlib to work with safe-string ( the patch can >> be found in the ocaml-lib-devel archives, >> http://sourceforge.net/p/ocaml-lib/mailman/message/32877133/ ), and >> while it mostly went smoothly, there was a pain point that I think >> would be worth discussing. >> >> The question is, when converting an existing library interface, how to >> decide whether any given part of the API should remain a "string" or >> be moved to "bytes" ( >> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Bytes.html ) -- or >> maybe provide two functions, one for each type. >> >> # The problem >> >> The new distinction between bytes and string, added in 4.02, actually >> plays on two different intuitions: >> - bytes represents (1) mutable (2) sequences of bytes >> - string represents (1) immutable (2) end-user text (which happen to >> be represented as sequence of bytes, but we could think of >> representing them as eg. Javascript strings in the future and with >> js_of_ocaml, or with ropes, etc.) >> >> The problem is that aspects (1) and (2) are somewhat orthogonal. I >> don't think we're interested in mutable end-user texts, but I >> encountered a few notable cases of (1) immutable (2) sequences of >> bytes. The problem is: should those be typed as string, or bytes? >> >> (There may be a difference between functions that assume their >> arguments are immutable, and function that simply guarantee that they >> won't themselves mutate their arguments. For now I'll assume those two >> cases count as "immutable sequences of bytes"). >> >> Right now, the standard library itself does a strange job of making a >> choice. The Marshal module ( >> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Marshal.html ) >> appears to favor the choice of "bytes" for non-mutated byte sequences >> (eg. data_size, total_size), while the Digest module ( >> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Digest.html ) >> remained in the land of strings. >> >> >> # An ideal solution >> >> In an ideal world, I claim the best solution would be the following. >> Given that it is clear (to me) that mutable byte sequences and >> immutable byte sequences share the same representation, we should use >> phantom type to distinguish them: >> >> type mut >> type immut >> type 'a bytes >> >> val get : 'a bytes -> int -> char >> val set : mut bytes -> int -> char >> Digest.t =3D immut bytes >> >> Using phantom types had been considered at the time of the >> bytes/string split, but rejected because suddenly adding polymorphism >> to string literals and string functions broke a lot of code ("The type >> of this expression, ..., contains a type variable that cannot be >> generalized", or suddenly-polymorphic method return types). More >> importantly, we do not want to enforce string and bytes to always have >> the same underlying representation. Neither arguments hold for >> mutable/immutable bytes. >> >> >> # Going forward >> >> It is probably a bit too late to change the "bytes" type in the >> compiler standard library. (Well, feel free to disagree on this.) >> And maybe we don't need to: just as more featureful, higher-level >> libraries have been developed outside the OCaml distribution, we could >> think of having a safer, higher-level phantom representation of byte >> sequences, as an external library. >> >> Regardless of what we do about this, I would recommend that immutable >> byte sequences (things that are, by design, not text) be represented >> as "bytes" rather than "string"=C2=B9. If/whenever a consensus on a safer >> phantom representation appear, it will be possible to convert to it >> without changing the representation. >> Similarly, if your bytes-taking function does not mutate or capture >> its input, you should mention it informally in its >> specification/documentation (and maybe express this with a phantom >> type later): this is important to reason about, for example, (un)safe >> conversions on those byte sequences. >> >> =C2=B9: a dissenting opinion could suggest that it is more important to = get >> the type-checker help re. mutability than expose the distinction >> between byte-level data and text (which should be an abstract type in >> some UTF8 library anyway), and thus immutable anything should rather >> be "string". I think the phantom type approach is superior, and we >> should design interfaces with it in mind. > > >