Re: [Caml-list] Feedback on -safe-string migration attempts

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Jacques Garrigue <garrigue@math.nagoya-u.ac.jp>
To: Gabriel Scherer <gabriel.scherer@gmail.com>
Cc: OCaML List Mailing <caml-list@inria.fr>
Subject: Re: [Caml-list] Feedback on -safe-string migration attempts
Date: Mon, 6 Oct 2014 11:11:14 +0900	[thread overview]
Message-ID: <8BB0EB51-3516-4CB9-94EC-513BA87CD4FF@math.nagoya-u.ac.jp> (raw)
In-Reply-To: <CAPFanBEXe1rVifeL2fYH-hbcLXJMf3zWjqP3-K1Zhhh7L7p61Q@mail.gmail.com>

Hi Gabriel,

I think this is an interesting proposal.
We didn’t consider it when adding the -safe-string option, but I do
not think it is too late to change things in the compiler:
keeping compatibility with pre-4.02 code is essential, but Bytes
itself is still experimental, so changing the bytes type should be ok.

Actually, I think no decision was reached whether the ability to have
different internal representations for string and bytes is really important.
This may matter when you use javascript as backend, but otherwise?
If we decided to keep the same representation by default, then a
reasonable approach would be to adopt your proposal with the following
extra twist:

* in safe-string mode, string is an alias for immut bytes
	type 'a bytes
	type string = immut bytes
* in legacy mode, bytes is an alias for string
	type string
	type 'a bytes = string

To keep good compatibility, the functions in the String module would
only have monomorphic types.
The notation "s.[n]" is a subtle case, but a solution could be to have it
expanded to the monomorphic String.get in legacy mode, and to the polymorphic
Bytes.get in safe-string mode. This would allow to keep the "s.[n] <- e”
notation too.

I think this would be more comfortable to use than the current state.

	Jacques

On 2014/10/06 02:19, Gabriel Scherer wrote:
> 
> Hi list,
> 
> I recently converted Extlib to work with safe-string ( the patch can
> be found in the ocaml-lib-devel archives,
> http://sourceforge.net/p/ocaml-lib/mailman/message/32877133/ ), and
> while it mostly went smoothly, there was a pain point that I think
> would be worth discussing.
> 
> The question is, when converting an existing library interface, how to
> decide whether any given part of the API should remain a "string" or
> be moved to "bytes" (
> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Bytes.html ) -- or
> maybe provide two functions, one for each type.
> 
> # The problem
> 
> The new distinction between bytes and string, added in 4.02, actually
> plays on two different intuitions:
> - bytes represents (1) mutable (2) sequences of bytes
> - string represents (1) immutable (2) end-user text (which happen to
> be represented as sequence of bytes, but we could think of
> representing them as eg. Javascript strings in the future and with
> js_of_ocaml, or with ropes, etc.)
> 
> The problem is that aspects (1) and (2) are somewhat orthogonal. I
> don't think we're interested in mutable end-user texts, but I
> encountered a few notable cases of (1) immutable (2) sequences of
> bytes. The problem is: should those be typed as string, or bytes?
> 
> (There may be a difference between functions that assume their
> arguments are immutable, and function that simply guarantee that they
> won't themselves mutate their arguments. For now I'll assume those two
> cases count as "immutable sequences of bytes").
> 
> Right now, the standard library itself does a strange job of making a
> choice. The Marshal module (
> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Marshal.html )
> appears to favor the choice of "bytes" for non-mutated byte sequences
> (eg. data_size, total_size), while the Digest module (
> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Digest.html )
> remained in the land of strings.
> 
> 
> # An ideal solution
> 
> In an ideal world, I claim the best solution would be the following.
> Given that it is clear (to me) that mutable byte sequences and
> immutable byte sequences share the same representation, we should use
> phantom type to distinguish them:
> 
>  type mut
>  type immut
>  type 'a bytes
> 
>  val get : 'a bytes -> int -> char
>  val set : mut bytes -> int -> char
>  Digest.t = immut bytes
> 
> Using phantom types had been considered at the time of the
> bytes/string split, but rejected because suddenly adding polymorphism
> to string literals and string functions broke a lot of code ("The type
> of this expression, ..., contains a type variable that cannot be
> generalized", or suddenly-polymorphic method return types). More
> importantly, we do not want to enforce string and bytes to always have
> the same underlying representation. Neither arguments hold for
> mutable/immutable bytes.
> 
> 
> # Going forward
> 
> It is probably a bit too late to change the "bytes" type in the
> compiler standard library. (Well, feel free to disagree on this.)
> And maybe we don't need to: just as more featureful, higher-level
> libraries have been developed outside the OCaml distribution, we could
> think of having a safer, higher-level phantom representation of byte
> sequences, as an external library.
> 
> Regardless of what we do about this, I would recommend that immutable
> byte sequences (things that are, by design, not text) be represented
> as "bytes" rather than "string"¹. If/whenever a consensus on a safer
> phantom representation appear, it will be possible to convert to it
> without changing the representation.
> Similarly, if your bytes-taking function does not mutate or capture
> its input, you should mention it informally in its
> specification/documentation (and maybe express this with a phantom
> type later): this is important to reason about, for example, (un)safe
> conversions on those byte sequences.
> 
> ¹: a dissenting opinion could suggest that it is more important to get
> the type-checker help re. mutability than expose the distinction
> between byte-level data and text (which should be an abstract type in
> some UTF8 library anyway), and thus immutable anything should rather
> be "string". I think the phantom type approach is superior, and we
> should design interfaces with it in mind.

next prev parent reply	other threads:[~2014-10-06  2:11 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-05 17:19 Gabriel Scherer
2014-10-06  2:11 ` Jacques Garrigue [this message]
2014-10-06  8:15   ` Alain Frisch
2014-10-06 11:08   ` Gabriel Scherer
2014-10-06 10:03 ` Gerd Stolpmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8BB0EB51-3516-4CB9-94EC-513BA87CD4FF@math.nagoya-u.ac.jp \
    --to=garrigue@math.nagoya-u.ac.jp \
    --cc=caml-list@inria.fr \
    --cc=gabriel.scherer@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).