From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <alain@frisch.fr>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104])
	by sympa.inria.fr (Postfix) with ESMTPS id 7F0277FA13
	for <caml-list@sympa.inria.fr>; Mon,  7 Jul 2014 14:42:23 +0200 (CEST)
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  alain@frisch.fr) identity=pra; client-ip=193.252.23.210;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="alain@frisch.fr"; x-sender="alain@frisch.fr";
  x-conformance=sidf_compatible
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  alain@frisch.fr) identity=mailfrom; client-ip=193.252.23.210;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="alain@frisch.fr"; x-sender="alain@frisch.fr";
  x-conformance=sidf_compatible
Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender
  authenticity information available from domain of
  postmaster@msa.smtpout.orange.fr) identity=helo;
  client-ip=193.252.23.210;
  receiver=mail3-smtp-sop.national.inria.fr;
  envelope-from="alain@frisch.fr";
  x-sender="postmaster@msa.smtpout.orange.fr";
  x-conformance=sidf_compatible
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AiABAK6UulPB/BfSnGdsb2JhbABRCYNgxxMBgSoPAQEBAQEICwkJFCiEAwEBBScRLRMRCw4KCRYPCQMCAQIBRQYBDAgBARAHiCsJyUoXjQSBQyg6hEMFmnaBSIVHkEJq
X-IPAS-Result: AiABAK6UulPB/BfSnGdsb2JhbABRCYNgxxMBgSoPAQEBAQEICwkJFCiEAwEBBScRLRMRCw4KCRYPCQMCAQIBRQYBDAgBARAHiCsJyUoXjQSBQyg6hEMFmnaBSIVHkEJq
X-IronPort-AV: E=Sophos;i="5.01,618,1400018400"; 
   d="scan'208";a="70410214"
Received: from msa01.smtpout.orange.fr (HELO msa.smtpout.orange.fr) ([193.252.23.210])
  by mail3-smtp-sop.national.inria.fr with ESMTP; 07 Jul 2014 14:42:22 +0200
Received: from [192.168.1.133] ([92.151.60.153])
	by mwinf5d54 with ME
	id PQiM1o0053JNEic03QiMuY; Mon, 07 Jul 2014 14:42:21 +0200
X-ME-Helo: [192.168.1.133]
X-ME-Auth: bGV4aWZpQHdhbmFkb28uZnI=
X-ME-Date: Mon, 07 Jul 2014 14:42:21 +0200
X-ME-IP: 92.151.60.153
Message-ID: <53BA95AC.3050602@frisch.fr>
Date: Mon, 07 Jul 2014 14:42:20 +0200
From: Alain Frisch <alain@frisch.fr>
User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: Gerd Stolpmann <info@gerd-stolpmann.de>, 
 caml-list <caml-list@inria.fr>
References: <1404501528.4384.4.camel@e130>
In-Reply-To: <1404501528.4384.4.camel@e130>
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Caml-list] Immutable strings

Hi Gerd,

Thanks for your interesting post.  Your general point about not breaking 
backward compatibility at the source level, as long as only "basic" 
features are used, is important.  Caml is now more than 30 years old (20 
years for OCaml), and it would be very constraining not to prevent 
ourselves from fixing bugs in the language design, including when they 
are about core features.  Some care need to be taken to provide a nice 
story to long-term users and a smooth migration path, using a 
combination of social means (interaction with the community) and 
technical ones (backward compatibility mode, "deprecated" warnings, 
sometimes tools to automate the transition).  Even if we look only at 
industrial adoption, OCaml compete with languages more recently designed 
and if we cannot touch revisit existing choices, the risk is real for 
OCaml to appear "frozen", less modern, and a less compelling choice for 
new projects.  This needs to be balanced against the risk of putting off 
owners of "passive" code bases (on which no dedicated development team 
work on a regular basis, but which need to be marginally modified and 
re-compiled once in a while).

Concerning immutable strings, the migration path seems quite good to me: 
a warning tells you about direct uses of string-mutation features (such 
as String.set), and the default behavior does not break existing code. 
FWIW, it was a matter of hours to go through the entire LexiFi's code 
base to enable the new safe mode, and as always in such operations, it 
was a good opportunity to factorize similar code.  And Jane Street does 
not seem overly worried by the task ( see 
https://blogs.janestreet.com/ocaml-4-02-everything-else/ ).


As one of the problems with the current solution, you mention that 
conversion of strings to bytes and vice versa requires a copy, which 
incurs some performance penalty.  This is true, but the new system can 
also avoid a lot of string copying (in safe mode).  With mutable 
strings, a library which expects strings from the client code and 
depends on those strings to remain the same need to copy them, and 
similarly, it cannot return directly a string from its internal data 
structures, since the client could modify them and thus break internal 
invariants.  (Many libraries don't do such copy, and, in the good cases, 
mention in their documentation that the strings should be treated as 
immutable ones by the caller.  This is clearly a source of possibly 
tricky bugs.)

The biggest problem you mention is related to the fact that in many 
contexts, both mutable and immutable strings could be relevant.  Your 
first idea to address this problem is to consider bytes (mutable 
strings) as a subtype of (immutable) strings.  This only addresses part 
of the problem: a library might still need to copy most strings on its 
boundaries to ensure a proper semantics; not only strings returned by 
the library as you mention, but also some strings passed from the client 
code to the library functions.

Your second idea is to create a common supertype of both string and 
bytes, to be used in contexts which can consume either type.  A minor 
variantiation would be to introduce a third abstract type, with 
"identity" injection from byte and string to it, and to expose the 
"read-only" part of the string API on it.    This can entirely be 
implemented in user-land (although it could benefit from being in the 
stdlib, so that e.g. Lexing could expose a from_stringlike).   Another 
variant of it is to see "stringlike" as a type class, implemented with 
explicit dictionaries.  This could be done with records:

   type 'a stringlike = {
     get: 'a -> int -> char;
     length: 'a -> int;
     sub_string: 'a -> int -> int -> string;
     output: out_channel -> 'a -> unit;
     ...
   }

There would be two constant records "string stringlike" and "bytes 
stringlike", and functions accepting either kind of string would take an 
extra stringlike argument.  (Alternatively, one could use first class 
modules instead of records.)  There is some overhead related to the 
dynamic dispatch, but I'm not convinced this would be unacceptable. 
Your third idea (using char bigarrays) would then fit nicely in this 
approach.

Another direction would be to support also the case of functions which 
can return either a bytes or a string.  A typical case is Bytes.sub / 
Bytes.sub_string.  One could also want Bytes.cat_to_string: bytes -> 
bytes -> string in addition to Bytes.cat: bytes -> bytes -> bytes.  For 
those cases, one could introduce a GADT such as:

  type _ is_a_string =
     | String: string is_a_string
     | Bytes: bytes is_a_string
     (* potentially more cases *)

You could then pass the desired constructor to the functions, e.g.: 
Bytes.sub: bytes -> int -> int -> 'a is_a_string -> 'a.  The cost of 
dispatching on the constructor is tiny, and the stdlib could bypass the 
test altogether using unsafe features internally.  Higher-level 
functions which can return either a string or a bytes are likely to 
produce the result by passing the is_a_string value down to lower-level 
functions.  But one could also imagine that some function behave 
differently according to the actual type of result.  For instance, a 
function which is likely to produce often the same strings could decide 
to keep a (weak) table of already returned strings, or to have a 
hard-coded list of common strings; this works only for immutable 
results, and so the function needs to check the is_a_string constructor 
to enable/disable these optimizations.  The "stringlike" idea could also 
be replaced by this is_a_string GADT, so that there could be a single 
function:

  val sub: 'a is_a_string -> 'a -> int -> int -> 'b is_a_string -> 'b


All that said, I think the current situation is already a net 
improvement over the previous one, and that further layers can be built 
on top of it, if needed (and not necessarily in stdlib).


Alain


On 07/04/2014 09:18 PM, Gerd Stolpmann wrote:
> Hi list,
>
> I've just posted a blog article where I criticize the new concept of
> immutable strings that will be available in OCaml 4.02 (as option):
>
> http://blog.camlcity.org/blog/bytes1.html
>
> In short my point is that it the new concept is not far reaching enough,
> and will even have negative impact on the code quality when it is not
> improved. I also present three ideas how to improve it.
>
> Gerd
>