Am Sonntag, den 05.10.2014, 19:19 +0200 schrieb Gabriel Scherer:
> The question is, when converting an existing library interface, how to
> decide whether any given part of the API should remain a "string" or
> be moved to "bytes" (
> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Bytes.html ) -- or
> maybe provide two functions, one for each type.
> 
> # The problem
> 
> The new distinction between bytes and string, added in 4.02, actually
> plays on two different intuitions:
> - bytes represents (1) mutable (2) sequences of bytes
> - string represents (1) immutable (2) end-user text (which happen to
> be represented as sequence of bytes, but we could think of
> representing them as eg. Javascript strings in the future and with
> js_of_ocaml, or with ropes, etc.)

Well, I think there are different views on this: In the OCaml stdlib
there is no distinction between character and byte, and it is left to
the user how to represent text (e.g. to use multibyte UTF-8 text). From
that point of view it is clear that "string" is for immutable data, no
matter weather text or bytes, and "bytes" is for mutable data of either
kind. However, as you found out this is sometimes impractical. If you
have some data you don't want to draw a somewhat arbitrary line between
mutable and immutable appearances of it.

I also will have to convert a fairly amount of code, and esp. for
Ocamlnet I really don't see how to do this cleanly, because the kind of
data suddenly changes. For instance the HTTP client reads bytes into a
bytes buffer, but the HTTP headers have more the characteristics of
text.

> The problem is that aspects (1) and (2) are somewhat orthogonal. I
> don't think we're interested in mutable end-user texts, but I
> encountered a few notable cases of (1) immutable (2) sequences of
> bytes. The problem is: should those be typed as string, or bytes?
> 
> (There may be a difference between functions that assume their
> arguments are immutable, and function that simply guarantee that they
> won't themselves mutate their arguments. For now I'll assume those two
> cases count as "immutable sequences of bytes").
> 
> Right now, the standard library itself does a strange job of making a
> choice. The Marshal module (
> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Marshal.html )
> appears to favor the choice of "bytes" for non-mutated byte sequences
> (eg. data_size, total_size), while the Digest module (
> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Digest.html )
> remained in the land of strings.

I also noticed that some functionality is now available over two
interfaces. In particular, there are now functions for writing from a
bytes buffer and for writing from a string buffer (e.g. Unix.write and
Unix.write_substring). My thinking is that all stdlib functions should
now be provided in that manner.

> # An ideal solution
> 
> In an ideal world, I claim the best solution would be the following.
> Given that it is clear (to me) that mutable byte sequences and
> immutable byte sequences share the same representation, we should use
> phantom type to distinguish them:
> 
>   type mut
>   type immut
>   type 'a bytes
> 
>   val get : 'a bytes -> int -> char
>   val set : mut bytes -> int -> char
>   Digest.t = immut bytes
> 
> Using phantom types had been considered at the time of the
> bytes/string split, but rejected because suddenly adding polymorphism
> to string literals and string functions broke a lot of code ("The type
> of this expression, ..., contains a type variable that cannot be
> generalized", or suddenly-polymorphic method return types). More
> importantly, we do not want to enforce string and bytes to always have
> the same underlying representation. Neither arguments hold for
> mutable/immutable bytes.

A new variant! The problem is still that it introduces polymorphisms.

For OCamlnet I was more thinking of providing all internal string
functionality also with string_reader interface. A string_reader is a
little abstraction on top of string/bytes/char bigarray that abstracts
the representation, and provides the most needed functions (at least
what String/Bytes have, plus extensions like searching, conversions,
maybe even regexps). I think that's the missing piece to make the
"bytes" change acceptable:

module String_reader : sig
  type t
  val for_string : string -> t
  val for_bytes : bytes -> t
  val for_memory : (char,...,...) Bigarray.Array1.t -> t

  val get : t -> int -> char
  val sub_string : t -> int -> int -> string
  val sub_bytes : t -> int -> int -> bytes
  val blit_to_bytes : t -> int -> bytes -> int -> int -> unit
  val blit_to_memory : t -> int -> (char,...) Bigarray.Array1.t -> int
-> int -> unit

  val index : t -> char -> int

  val search_leftmost : t -> t -> int -> int
  val search_rightmost : t -> t -> int -> int

  val get_int32_le : t -> int -> int32
  val get_int32_be : t -> int -> int32
  val get_int64_le : t -> int -> int64
  val get_int64_be : t -> int -> int64

  (* plus more ... not sure yet what to cover exactly *)

end


Gerd


> 
> 
> # Going forward
> 
> It is probably a bit too late to change the "bytes" type in the
> compiler standard library. (Well, feel free to disagree on this.)
> And maybe we don't need to: just as more featureful, higher-level
> libraries have been developed outside the OCaml distribution, we could
> think of having a safer, higher-level phantom representation of byte
> sequences, as an external library.
> 
> Regardless of what we do about this, I would recommend that immutable
> byte sequences (things that are, by design, not text) be represented
> as "bytes" rather than "string"š. If/whenever a consensus on a safer
> phantom representation appear, it will be possible to convert to it
> without changing the representation.
> Similarly, if your bytes-taking function does not mutate or capture
> its input, you should mention it informally in its
> specification/documentation (and maybe express this with a phantom
> type later): this is important to reason about, for example, (un)safe
> conversions on those byte sequences.
> 
> š: a dissenting opinion could suggest that it is more important to get
> the type-checker help re. mutability than expose the distinction
> between byte-level data and text (which should be an abstract type in
> some UTF8 library anyway), and thus immutable anything should rather
> be "string". I think the phantom type approach is superior, and we
> should design interfaces with it in mind.
> 

-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
My OCaml site:          http://www.camlcity.org
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------