caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Gabriel Scherer <gabriel.scherer@gmail.com>
To: Jacques Garrigue <garrigue@math.nagoya-u.ac.jp>
Cc: OCaML List Mailing <caml-list@inria.fr>
Subject: Re: [Caml-list] Feedback on -safe-string migration attempts
Date: Mon, 6 Oct 2014 13:08:58 +0200	[thread overview]
Message-ID: <CAPFanBGkJEGFkXzLh=fnhHPEFhX4m3je+Xg12g0Rp_YBUyGihg@mail.gmail.com> (raw)
In-Reply-To: <8BB0EB51-3516-4CB9-94EC-513BA87CD4FF@math.nagoya-u.ac.jp>

I'm strongly convinced that allowing a difference in representation is
the right choice. I'm not actively pushing for using another
representation, but in my experience thinking of them as distinct is
a very good thought-experiment to design better interfaces.

I would compare this to the idea that "data created at an immutable
type might be allocated in read-only memory": I'm not pushing for this
to be implemented, but it's an excellent thought-experiment to explain
why certains use of Obj.magic are assuredly wrong.

(In my experience so far working with the stdlib, Extlib and
Batteries, I have not felt that the distinct representations were
painful. In particular, I needed Bytes.get rarely enough that the lack
of s.[n] syntax was never an issue.)


I would thus rather suggest the following interfaces:
  type 'a bytes (* = string  under -unsafe-string *)
  type string

An interesting interface change would be the conversion functions. We
currently have:

  Bytes.of_string : string -> bytes
  Bytes.to_string : bytes -> string

  Bytes.copy : bytes -> bytes

  (* see Bytes documentation *)
  Bytes.unsafe_of_string : string -> bytes
  Bytes.unsafe_to_string : bytes -> string

We could instead have something like the following:

  Bytes.immut_of_string : string -> immut bytes
  Bytes.immut_to_string : immut bytes -> string

  Bytes.copy : 'a bytes -> 'b bytes

  (* same usage restrinction than Bytes.unsafe_{of,to}_string,
     see the documentation *)
  Bytes.unsafe_mut : immut bytes -> mut bytes
  Bytes.unsafe_immut : bytes bytes -> immut bytes

with the aliases

  Bytes.of_string : string -> 'a bytes
  of_string s = copy (immut_of_string s)

  Bytes.to_string : 'a bytes -> string
  to_string s = immut_to_string (copy s)

  Bytes.unsafe_to_string : mut bytes -> string
  unsafe_to_string s = immut_to_string (unsafe_immut s)

(unsafe_of_string could go away: it can only be correctly used if the
resulting bytes is used immutably, so it is superseded by the safe
immut_of_string function.)

On Mon, Oct 6, 2014 at 4:11 AM, Jacques Garrigue
<garrigue@math.nagoya-u.ac.jp> wrote:
> Hi Gabriel,
>
> I think this is an interesting proposal.
> We didn’t consider it when adding the -safe-string option, but I do
> not think it is too late to change things in the compiler:
> keeping compatibility with pre-4.02 code is essential, but Bytes
> itself is still experimental, so changing the bytes type should be ok.
>
> Actually, I think no decision was reached whether the ability to have
> different internal representations for string and bytes is really important.
> This may matter when you use javascript as backend, but otherwise?
> If we decided to keep the same representation by default, then a
> reasonable approach would be to adopt your proposal with the following
> extra twist:
>
> * in safe-string mode, string is an alias for immut bytes
>         type 'a bytes
>         type string = immut bytes
> * in legacy mode, bytes is an alias for string
>         type string
>         type 'a bytes = string
>
> To keep good compatibility, the functions in the String module would
> only have monomorphic types.
> The notation "s.[n]" is a subtle case, but a solution could be to have it
> expanded to the monomorphic String.get in legacy mode, and to the polymorphic
> Bytes.get in safe-string mode. This would allow to keep the "s.[n] <- e”
> notation too.
>
> I think this would be more comfortable to use than the current state.
>
>         Jacques
>
> On 2014/10/06 02:19, Gabriel Scherer wrote:
>>
>> Hi list,
>>
>> I recently converted Extlib to work with safe-string ( the patch can
>> be found in the ocaml-lib-devel archives,
>> http://sourceforge.net/p/ocaml-lib/mailman/message/32877133/ ), and
>> while it mostly went smoothly, there was a pain point that I think
>> would be worth discussing.
>>
>> The question is, when converting an existing library interface, how to
>> decide whether any given part of the API should remain a "string" or
>> be moved to "bytes" (
>> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Bytes.html ) -- or
>> maybe provide two functions, one for each type.
>>
>> # The problem
>>
>> The new distinction between bytes and string, added in 4.02, actually
>> plays on two different intuitions:
>> - bytes represents (1) mutable (2) sequences of bytes
>> - string represents (1) immutable (2) end-user text (which happen to
>> be represented as sequence of bytes, but we could think of
>> representing them as eg. Javascript strings in the future and with
>> js_of_ocaml, or with ropes, etc.)
>>
>> The problem is that aspects (1) and (2) are somewhat orthogonal. I
>> don't think we're interested in mutable end-user texts, but I
>> encountered a few notable cases of (1) immutable (2) sequences of
>> bytes. The problem is: should those be typed as string, or bytes?
>>
>> (There may be a difference between functions that assume their
>> arguments are immutable, and function that simply guarantee that they
>> won't themselves mutate their arguments. For now I'll assume those two
>> cases count as "immutable sequences of bytes").
>>
>> Right now, the standard library itself does a strange job of making a
>> choice. The Marshal module (
>> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Marshal.html )
>> appears to favor the choice of "bytes" for non-mutated byte sequences
>> (eg. data_size, total_size), while the Digest module (
>> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Digest.html )
>> remained in the land of strings.
>>
>>
>> # An ideal solution
>>
>> In an ideal world, I claim the best solution would be the following.
>> Given that it is clear (to me) that mutable byte sequences and
>> immutable byte sequences share the same representation, we should use
>> phantom type to distinguish them:
>>
>>  type mut
>>  type immut
>>  type 'a bytes
>>
>>  val get : 'a bytes -> int -> char
>>  val set : mut bytes -> int -> char
>>  Digest.t = immut bytes
>>
>> Using phantom types had been considered at the time of the
>> bytes/string split, but rejected because suddenly adding polymorphism
>> to string literals and string functions broke a lot of code ("The type
>> of this expression, ..., contains a type variable that cannot be
>> generalized", or suddenly-polymorphic method return types). More
>> importantly, we do not want to enforce string and bytes to always have
>> the same underlying representation. Neither arguments hold for
>> mutable/immutable bytes.
>>
>>
>> # Going forward
>>
>> It is probably a bit too late to change the "bytes" type in the
>> compiler standard library. (Well, feel free to disagree on this.)
>> And maybe we don't need to: just as more featureful, higher-level
>> libraries have been developed outside the OCaml distribution, we could
>> think of having a safer, higher-level phantom representation of byte
>> sequences, as an external library.
>>
>> Regardless of what we do about this, I would recommend that immutable
>> byte sequences (things that are, by design, not text) be represented
>> as "bytes" rather than "string"¹. If/whenever a consensus on a safer
>> phantom representation appear, it will be possible to convert to it
>> without changing the representation.
>> Similarly, if your bytes-taking function does not mutate or capture
>> its input, you should mention it informally in its
>> specification/documentation (and maybe express this with a phantom
>> type later): this is important to reason about, for example, (un)safe
>> conversions on those byte sequences.
>>
>> ¹: a dissenting opinion could suggest that it is more important to get
>> the type-checker help re. mutability than expose the distinction
>> between byte-level data and text (which should be an abstract type in
>> some UTF8 library anyway), and thus immutable anything should rather
>> be "string". I think the phantom type approach is superior, and we
>> should design interfaces with it in mind.
>
>
>

  parent reply	other threads:[~2014-10-06 11:09 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-05 17:19 Gabriel Scherer
2014-10-06  2:11 ` Jacques Garrigue
2014-10-06  8:15   ` Alain Frisch
2014-10-06 11:08   ` Gabriel Scherer [this message]
2014-10-06 10:03 ` Gerd Stolpmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPFanBGkJEGFkXzLh=fnhHPEFhX4m3je+Xg12g0Rp_YBUyGihg@mail.gmail.com' \
    --to=gabriel.scherer@gmail.com \
    --cc=caml-list@inria.fr \
    --cc=garrigue@math.nagoya-u.ac.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).