caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Feedback on -safe-string migration attempts
@ 2014-10-05 17:19 Gabriel Scherer
  2014-10-06  2:11 ` Jacques Garrigue
  2014-10-06 10:03 ` Gerd Stolpmann
  0 siblings, 2 replies; 5+ messages in thread
From: Gabriel Scherer @ 2014-10-05 17:19 UTC (permalink / raw)
  To: caml users

Hi list,

I recently converted Extlib to work with safe-string ( the patch can
be found in the ocaml-lib-devel archives,
http://sourceforge.net/p/ocaml-lib/mailman/message/32877133/ ), and
while it mostly went smoothly, there was a pain point that I think
would be worth discussing.

The question is, when converting an existing library interface, how to
decide whether any given part of the API should remain a "string" or
be moved to "bytes" (
http://caml.inria.fr/pub/docs/manual-ocaml/libref/Bytes.html ) -- or
maybe provide two functions, one for each type.

# The problem

The new distinction between bytes and string, added in 4.02, actually
plays on two different intuitions:
- bytes represents (1) mutable (2) sequences of bytes
- string represents (1) immutable (2) end-user text (which happen to
be represented as sequence of bytes, but we could think of
representing them as eg. Javascript strings in the future and with
js_of_ocaml, or with ropes, etc.)

The problem is that aspects (1) and (2) are somewhat orthogonal. I
don't think we're interested in mutable end-user texts, but I
encountered a few notable cases of (1) immutable (2) sequences of
bytes. The problem is: should those be typed as string, or bytes?

(There may be a difference between functions that assume their
arguments are immutable, and function that simply guarantee that they
won't themselves mutate their arguments. For now I'll assume those two
cases count as "immutable sequences of bytes").

Right now, the standard library itself does a strange job of making a
choice. The Marshal module (
http://caml.inria.fr/pub/docs/manual-ocaml/libref/Marshal.html )
appears to favor the choice of "bytes" for non-mutated byte sequences
(eg. data_size, total_size), while the Digest module (
http://caml.inria.fr/pub/docs/manual-ocaml/libref/Digest.html )
remained in the land of strings.


# An ideal solution

In an ideal world, I claim the best solution would be the following.
Given that it is clear (to me) that mutable byte sequences and
immutable byte sequences share the same representation, we should use
phantom type to distinguish them:

  type mut
  type immut
  type 'a bytes

  val get : 'a bytes -> int -> char
  val set : mut bytes -> int -> char
  Digest.t = immut bytes

Using phantom types had been considered at the time of the
bytes/string split, but rejected because suddenly adding polymorphism
to string literals and string functions broke a lot of code ("The type
of this expression, ..., contains a type variable that cannot be
generalized", or suddenly-polymorphic method return types). More
importantly, we do not want to enforce string and bytes to always have
the same underlying representation. Neither arguments hold for
mutable/immutable bytes.


# Going forward

It is probably a bit too late to change the "bytes" type in the
compiler standard library. (Well, feel free to disagree on this.)
And maybe we don't need to: just as more featureful, higher-level
libraries have been developed outside the OCaml distribution, we could
think of having a safer, higher-level phantom representation of byte
sequences, as an external library.

Regardless of what we do about this, I would recommend that immutable
byte sequences (things that are, by design, not text) be represented
as "bytes" rather than "string"¹. If/whenever a consensus on a safer
phantom representation appear, it will be possible to convert to it
without changing the representation.
Similarly, if your bytes-taking function does not mutate or capture
its input, you should mention it informally in its
specification/documentation (and maybe express this with a phantom
type later): this is important to reason about, for example, (un)safe
conversions on those byte sequences.

¹: a dissenting opinion could suggest that it is more important to get
the type-checker help re. mutability than expose the distinction
between byte-level data and text (which should be an abstract type in
some UTF8 library anyway), and thus immutable anything should rather
be "string". I think the phantom type approach is superior, and we
should design interfaces with it in mind.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-10-06 11:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-05 17:19 [Caml-list] Feedback on -safe-string migration attempts Gabriel Scherer
2014-10-06  2:11 ` Jacques Garrigue
2014-10-06  8:15   ` Alain Frisch
2014-10-06 11:08   ` Gabriel Scherer
2014-10-06 10:03 ` Gerd Stolpmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).