From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <info@gerd-stolpmann.de>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83])
	by sympa.inria.fr (Postfix) with ESMTPS id 8A6937FA13
	for <caml-list@sympa.inria.fr>; Tue,  8 Jul 2014 14:24:10 +0200 (CEST)
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  info@gerd-stolpmann.de) identity=pra;
  client-ip=212.227.17.24;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="info@gerd-stolpmann.de";
  x-sender="info@gerd-stolpmann.de";
  x-conformance=sidf_compatible
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  info@gerd-stolpmann.de) identity=mailfrom;
  client-ip=212.227.17.24;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="info@gerd-stolpmann.de";
  x-sender="info@gerd-stolpmann.de";
  x-conformance=sidf_compatible
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  postmaster@mout.kundenserver.de) identity=helo;
  client-ip=212.227.17.24;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="info@gerd-stolpmann.de";
  x-sender="postmaster@mout.kundenserver.de";
  x-conformance=sidf_compatible
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AtMAALzhu1PU4xEYm2dsb2JhbABQCYNgv1CHRAGBFxYPAQEBAQEGCwsJFCiEAwEBBAEnLhAUBQsLGC5XBhMJEogfDAnJGReJSRyDH4FDNSYHgjYPRIE6BZBhhTWGKIVCBZBCag
X-IPAS-Result: AtMAALzhu1PU4xEYm2dsb2JhbABQCYNgv1CHRAGBFxYPAQEBAQEGCwsJFCiEAwEBBAEnLhAUBQsLGC5XBhMJEogfDAnJGReJSRyDH4FDNSYHgjYPRIE6BZBhhTWGKIVCBZBCag
X-IronPort-AV: E=Sophos;i="5.01,625,1400018400"; 
   d="asc'?scan'208";a="84139459"
Received: from mout.kundenserver.de ([212.227.17.24])
  by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 08 Jul 2014 14:24:09 +0200
Received: from office1.lan.sumadev.de (dslb-188-107-170-128.pools.arcor-ip.net [188.107.170.128])
	by mrelayeu.kundenserver.de (node=mreue102) with ESMTP (Nemesis)
	id 0MQert-1XBpCv31hN-00U2n4; Tue, 08 Jul 2014 14:24:08 +0200
Received: from [192.168.0.146] (546BEFE6.cm-12-4d.dynamic.ziggo.nl [84.107.239.230])
	by office1.lan.sumadev.de (Postfix) with ESMTPSA id 13A82DC270;
	Tue,  8 Jul 2014 14:24:08 +0200 (CEST)
Message-ID: <1404822242.4384.101.camel@e130>
From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: Alain Frisch <alain@frisch.fr>
Cc: caml-list <caml-list@inria.fr>
Date: Tue, 08 Jul 2014 14:24:02 +0200
In-Reply-To: <53BA95AC.3050602@frisch.fr>
References: <1404501528.4384.4.camel@e130> <53BA95AC.3050602@frisch.fr>
Content-Type: multipart/signed; micalg="pgp-sha1"; protocol="application/pgp-signature";
	boundary="=-B0ZIjPNN4eGnTDcB5Miu"
X-Mailer: Evolution 3.10.4-0ubuntu1 
Mime-Version: 1.0
X-Provags-ID: V02:K0:0dy5AQkf5SF2MakFbL7J7M2g7nT8JHI23iH6W3lU7K9
 l6YNSFln3f2EfztiE+IgKFB+FvSHMS2mr4XG44X0FNH+UB4h59
 PLSWf3c2UKw1EN5v5PHJ+TAeIbvWaN3RHOey8DYCvMveyDHsO3
 //nE/wq4V3iTeLptUBg+t9OolV+c2XiCpnEbvu2PLvfc/i7dV7
 4NPIyt8aD7jOdKW5WOFikVwzJ4zg67dHi4ZDyYNqeqoZD21GVI
 z8Iq9jozJUvwbVoQcB8FMLX1FAGwvFEeGyegTrMoNJJ/muNhl6
 snKwdGN9M1QHvK5wU7vYT+ypofTNrbSC7eoDiayyF/diyx/YR5
 9HO4+ThKhLw/Ukn9DuuKUYyRd6502zetFBOHe//nm
Subject: Re: [Caml-list] Immutable strings


--=-B0ZIjPNN4eGnTDcB5Miu
Content-Type: text/plain; charset="ISO-8859-15"
Content-Transfer-Encoding: quoted-printable

Am Montag, den 07.07.2014, 14:42 +0200 schrieb Alain Frisch:
> Hi Gerd,
>=20
> Thanks for your interesting post.  Your general point about not breaking=
=20
> backward compatibility at the source level, as long as only "basic"=20
> features are used, is important. ... Even if we look only at=20
> industrial adoption, OCaml compete with languages more recently designed=
=20
> and if we cannot touch revisit existing choices, the risk is real for=20
> OCaml to appear "frozen", less modern, and a less compelling choice for=20
> new projects.  This needs to be balanced against the risk of putting off=
=20
> owners of "passive" code bases (on which no dedicated development team=20
> work on a regular basis, but which need to be marginally modified and=20
> re-compiled once in a while).

It will create confusion even with actively maintained code bases. What
could help here is very clear communication when the change will be the
standard behavior, and how the migration will take place. Currently, it
feels like a big experiment - hey, let's users tentatively enable it,
and watch out for problems. That's quite naive. In particular, users
hitting problems will probably not try out the switch (or immediately
revert), because leaving the code base in a non-buildable state for
longer time is not an option. (And ignoring these users would not be
good, because it's exactly these users who are really doing string
mutation who could profit at most from the change.)

> Concerning immutable strings, the migration path seems quite good to me:=
=20
> a warning tells you about direct uses of string-mutation features (such=20
> as String.set), and the default behavior does not break existing code.=20

That's good for now, but I'm more expecting something like: next ocaml
version it is experimental (interfaces may still evolve). The following
version it is recommended standard and we'll emit a warning when
-safe-strings is not on. The version after that we'll make -safe-strings
the default, etc. Something like that. There could also be a section in
the manual explaining the new behavior, and how to convert code.

> FWIW, it was a matter of hours to go through the entire LexiFi's code=20
> base to enable the new safe mode, and as always in such operations, it=20
> was a good opportunity to factorize similar code.  And Jane Street does=20
> not seem overly worried by the task ( see=20
> https://blogs.janestreet.com/ocaml-4-02-everything-else/ ).

With my current customer, I don't see any bigger problems either,
because string mutation doesn't play a big role there (it's a compiler
project). I see a big problem with OCamlnet, though, as it is focused on
I/O, and the issue how to deal with buffers is quite central.

> As one of the problems with the current solution, you mention that=20
> conversion of strings to bytes and vice versa requires a copy, which=20
> incurs some performance penalty.  This is true, but the new system can=20
> also avoid a lot of string copying (in safe mode).  ...
>  (Many libraries don't do such copy, and, in the good cases,=20
> mention in their documentation that the strings should be treated as=20
> immutable ones by the caller.  This is clearly a source of possibly=20
> tricky bugs.)

Right, that's the good side of it. (Although the danger is quite
theoretical, as most programmers seem to intuitively follow the rule
"don't mutate strings you did not create". I've never seen this kind of
bug in practice.)

> Your second idea is to create a common supertype of both string and=20
> bytes, to be used in contexts which can consume either type.  A minor=20
> variantiation would be to introduce a third abstract type, with=20
> "identity" injection from byte and string to it, and to expose the=20
> "read-only" part of the string API on it.    This can entirely be=20
> implemented in user-land (although it could benefit from being in the=20
> stdlib, so that e.g. Lexing could expose a from_stringlike).

I think it would be quite important to have that in the stdlib:

 - This sets a standard for interoperability between libraries
 - The stdlib can exploit the details of the representation
 - It would be possible to use stringlike directly in C interfaces

For instance, there is one module in OCamlnet where a regexp is directly
run on an I/O buffer (generally, you need to do some primitive parsing
on I/O buffers before you can extract strings, and that's where
stringlike would be nice to have). Without stringlike, I would have to
replace that regexp somehow.

>    Another=20
> variant of it is to see "stringlike" as a type class, implemented with=20
> explicit dictionaries.  This could be done with records:
>=20
>    type 'a stringlike =3D {
>      get: 'a -> int -> char;
>      length: 'a -> int;
>      sub_string: 'a -> int -> int -> string;
>      output: out_channel -> 'a -> unit;
>      ...
>    }
>=20
> There would be two constant records "string stringlike" and "bytes=20
> stringlike", and functions accepting either kind of string would take an=
=20
> extra stringlike argument.  (Alternatively, one could use first class=20
> modules instead of records.)  There is some overhead related to the=20
> dynamic dispatch, but I'm not convinced this would be unacceptable.=20

The overhead is quite low. If you need to call e.g. "get" several times,
you could factor out the dictionary lookup:

let get =3D stringlike.get in ...

The (only) price is that the access cannot be inlined anymore.

> Your third idea (using char bigarrays) would then fit nicely in this=20
> approach.

Right, and it would even be possible to use that for other buffer
representations (e.g. I have a non-contiguous buffer type called
Netpagebuffer in OCamlnet that could also be compatible with stringlike;
also think of ring buffers). It's a really nice idea.

The downside is that I cannot imagine any easy way to support this in C
interfaces. Well, you could have

low_level_buffer : 'a -> (Obj.t * int * int)

that gets you a base address, an offset, and a length, but that could be
too optimistic. Maybe C interfaces should simply dynamically check
whether 'a is a string or bigarray, and fail otherwise. These dynamic
checks are at least possible (maybe there could be a caml_stringlike_val
function that does its very best).

Another downside of this approach is that it introduces a lot of type
variables.

> Another direction would be to support also the case of functions which=20
> can return either a bytes or a string.  A typical case is Bytes.sub /=20
> Bytes.sub_string.  One could also want Bytes.cat_to_string: bytes ->=20
> bytes -> string in addition to Bytes.cat: bytes -> bytes -> bytes.  For=20
> those cases, one could introduce a GADT such as:
>=20
>   type _ is_a_string =3D
>      | String: string is_a_string
>      | Bytes: bytes is_a_string
>      (* potentially more cases *)
>=20
> You could then pass the desired constructor to the functions, e.g.:=20
> Bytes.sub: bytes -> int -> int -> 'a is_a_string -> 'a.  The cost of=20
> dispatching on the constructor is tiny, and the stdlib could bypass the=20
> test altogether using unsafe features internally.  Higher-level=20
> functions which can return either a string or a bytes are likely to=20
> produce the result by passing the is_a_string value down to lower-level=20
> functions.

That's also a nice idea, and it will definitely save a few string copies
here and there.

>   But one could also imagine that some function behave=20
> differently according to the actual type of result.  For instance, a=20
> function which is likely to produce often the same strings could decide=20
> to keep a (weak) table of already returned strings, or to have a=20
> hard-coded list of common strings; this works only for immutable=20
> results, and so the function needs to check the is_a_string constructor=20
> to enable/disable these optimizations.  The "stringlike" idea could also=
=20
> be replaced by this is_a_string GADT, so that there could be a single=20
> function:
>=20
>   val sub: 'a is_a_string -> 'a -> int -> int -> 'b is_a_string -> 'b
>=20
>=20
> All that said, I think the current situation is already a net=20
> improvement over the previous one,=20

Well, I wouldn't say so because I'm missing good migration paths for
some important cases.

> and that further layers can be built=20
> on top of it, if needed (and not necessarily in stdlib).

Well, as pointed out, I'd really like to see one such layer in stdlib,
because we'll otherwise have five different solutions in the library
scene which are all incompatible to each other. (Your type class
suggestion looks easy and will already solve most of the issues; why not
just include it into the stdlib, it wouldn't need much: a new module
Stringlike defining it, the records for String and Bytes and maybe char
Bigarrays, and some extensions here and there where it is used, e.g. in
Lexing.) IMHO, it is important to really provide practical solutions,
and not only to theoretically have one.

Gerd


>=20
> Alain
>=20
>=20
>=20
> On 07/04/2014 09:18 PM, Gerd Stolpmann wrote:
> > Hi list,
> >
> > I've just posted a blog article where I criticize the new concept of
> > immutable strings that will be available in OCaml 4.02 (as option):
> >
> > http://blog.camlcity.org/blog/bytes1.html
> >
> > In short my point is that it the new concept is not far reaching enough,
> > and will even have negative impact on the code quality when it is not
> > improved. I also present three ideas how to improve it.
> >
> > Gerd
> >
>=20
>=20

--=20
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
My OCaml site:          http://www.camlcity.org
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------


--=-B0ZIjPNN4eGnTDcB5Miu
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAABAgAGBQJTu+LiAAoJEAaM4b9ZLB5T6OYH/3LScai7wPTvLKaj94t+CtmW
JO0tSV5OuytRoZLr7vl9/AGISGVW1pJFdV73K9U6dLFFX3VgWE++yRIKAAOI9Ouf
fT6oQGvevyQBNn+0WoTaUUBbHiDEW7nFYbUMl8mJbdZUwiIarzFKiJiTzTKctJNG
RLJBkLG3BdIwt9YBXBkIzJF7RL/3UZUAKsqdRwEkX+T+rMI0DpLD7cVbdv2nLPGl
H1ul6+7q9frUeVP9R7KsVe0U8v3zxnbE5ehQlO5/LDB+die7NMapQd1ZEovc9bgs
7JX0uBleTtQ6MMTLsH7/AQhE2O/gNY4/x8W6JaWziErcn5FMEUxwzfxR4mCseCw=
=MA7x
-----END PGP SIGNATURE-----

--=-B0ZIjPNN4eGnTDcB5Miu--