From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id 8A6937FA13 for ; Tue, 8 Jul 2014 14:24:10 +0200 (CEST) Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of info@gerd-stolpmann.de) identity=pra; client-ip=212.227.17.24; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="info@gerd-stolpmann.de"; x-sender="info@gerd-stolpmann.de"; x-conformance=sidf_compatible Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of info@gerd-stolpmann.de) identity=mailfrom; client-ip=212.227.17.24; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="info@gerd-stolpmann.de"; x-sender="info@gerd-stolpmann.de"; x-conformance=sidf_compatible Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@mout.kundenserver.de) identity=helo; client-ip=212.227.17.24; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="info@gerd-stolpmann.de"; x-sender="postmaster@mout.kundenserver.de"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AtMAALzhu1PU4xEYm2dsb2JhbABQCYNgv1CHRAGBFxYPAQEBAQEGCwsJFCiEAwEBBAEnLhAUBQsLGC5XBhMJEogfDAnJGReJSRyDH4FDNSYHgjYPRIE6BZBhhTWGKIVCBZBCag X-IPAS-Result: AtMAALzhu1PU4xEYm2dsb2JhbABQCYNgv1CHRAGBFxYPAQEBAQEGCwsJFCiEAwEBBAEnLhAUBQsLGC5XBhMJEogfDAnJGReJSRyDH4FDNSYHgjYPRIE6BZBhhTWGKIVCBZBCag X-IronPort-AV: E=Sophos;i="5.01,625,1400018400"; d="asc'?scan'208";a="84139459" Received: from mout.kundenserver.de ([212.227.17.24]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 08 Jul 2014 14:24:09 +0200 Received: from office1.lan.sumadev.de (dslb-188-107-170-128.pools.arcor-ip.net [188.107.170.128]) by mrelayeu.kundenserver.de (node=mreue102) with ESMTP (Nemesis) id 0MQert-1XBpCv31hN-00U2n4; Tue, 08 Jul 2014 14:24:08 +0200 Received: from [192.168.0.146] (546BEFE6.cm-12-4d.dynamic.ziggo.nl [84.107.239.230]) by office1.lan.sumadev.de (Postfix) with ESMTPSA id 13A82DC270; Tue, 8 Jul 2014 14:24:08 +0200 (CEST) Message-ID: <1404822242.4384.101.camel@e130> From: Gerd Stolpmann To: Alain Frisch Cc: caml-list Date: Tue, 08 Jul 2014 14:24:02 +0200 In-Reply-To: <53BA95AC.3050602@frisch.fr> References: <1404501528.4384.4.camel@e130> <53BA95AC.3050602@frisch.fr> Content-Type: multipart/signed; micalg="pgp-sha1"; protocol="application/pgp-signature"; boundary="=-B0ZIjPNN4eGnTDcB5Miu" X-Mailer: Evolution 3.10.4-0ubuntu1 Mime-Version: 1.0 X-Provags-ID: V02:K0:0dy5AQkf5SF2MakFbL7J7M2g7nT8JHI23iH6W3lU7K9 l6YNSFln3f2EfztiE+IgKFB+FvSHMS2mr4XG44X0FNH+UB4h59 PLSWf3c2UKw1EN5v5PHJ+TAeIbvWaN3RHOey8DYCvMveyDHsO3 //nE/wq4V3iTeLptUBg+t9OolV+c2XiCpnEbvu2PLvfc/i7dV7 4NPIyt8aD7jOdKW5WOFikVwzJ4zg67dHi4ZDyYNqeqoZD21GVI z8Iq9jozJUvwbVoQcB8FMLX1FAGwvFEeGyegTrMoNJJ/muNhl6 snKwdGN9M1QHvK5wU7vYT+ypofTNrbSC7eoDiayyF/diyx/YR5 9HO4+ThKhLw/Ukn9DuuKUYyRd6502zetFBOHe//nm Subject: Re: [Caml-list] Immutable strings --=-B0ZIjPNN4eGnTDcB5Miu Content-Type: text/plain; charset="ISO-8859-15" Content-Transfer-Encoding: quoted-printable Am Montag, den 07.07.2014, 14:42 +0200 schrieb Alain Frisch: > Hi Gerd, >=20 > Thanks for your interesting post. Your general point about not breaking= =20 > backward compatibility at the source level, as long as only "basic"=20 > features are used, is important. ... Even if we look only at=20 > industrial adoption, OCaml compete with languages more recently designed= =20 > and if we cannot touch revisit existing choices, the risk is real for=20 > OCaml to appear "frozen", less modern, and a less compelling choice for=20 > new projects. This needs to be balanced against the risk of putting off= =20 > owners of "passive" code bases (on which no dedicated development team=20 > work on a regular basis, but which need to be marginally modified and=20 > re-compiled once in a while). It will create confusion even with actively maintained code bases. What could help here is very clear communication when the change will be the standard behavior, and how the migration will take place. Currently, it feels like a big experiment - hey, let's users tentatively enable it, and watch out for problems. That's quite naive. In particular, users hitting problems will probably not try out the switch (or immediately revert), because leaving the code base in a non-buildable state for longer time is not an option. (And ignoring these users would not be good, because it's exactly these users who are really doing string mutation who could profit at most from the change.) > Concerning immutable strings, the migration path seems quite good to me:= =20 > a warning tells you about direct uses of string-mutation features (such=20 > as String.set), and the default behavior does not break existing code.=20 That's good for now, but I'm more expecting something like: next ocaml version it is experimental (interfaces may still evolve). The following version it is recommended standard and we'll emit a warning when -safe-strings is not on. The version after that we'll make -safe-strings the default, etc. Something like that. There could also be a section in the manual explaining the new behavior, and how to convert code. > FWIW, it was a matter of hours to go through the entire LexiFi's code=20 > base to enable the new safe mode, and as always in such operations, it=20 > was a good opportunity to factorize similar code. And Jane Street does=20 > not seem overly worried by the task ( see=20 > https://blogs.janestreet.com/ocaml-4-02-everything-else/ ). With my current customer, I don't see any bigger problems either, because string mutation doesn't play a big role there (it's a compiler project). I see a big problem with OCamlnet, though, as it is focused on I/O, and the issue how to deal with buffers is quite central. > As one of the problems with the current solution, you mention that=20 > conversion of strings to bytes and vice versa requires a copy, which=20 > incurs some performance penalty. This is true, but the new system can=20 > also avoid a lot of string copying (in safe mode). ... > (Many libraries don't do such copy, and, in the good cases,=20 > mention in their documentation that the strings should be treated as=20 > immutable ones by the caller. This is clearly a source of possibly=20 > tricky bugs.) Right, that's the good side of it. (Although the danger is quite theoretical, as most programmers seem to intuitively follow the rule "don't mutate strings you did not create". I've never seen this kind of bug in practice.) > Your second idea is to create a common supertype of both string and=20 > bytes, to be used in contexts which can consume either type. A minor=20 > variantiation would be to introduce a third abstract type, with=20 > "identity" injection from byte and string to it, and to expose the=20 > "read-only" part of the string API on it. This can entirely be=20 > implemented in user-land (although it could benefit from being in the=20 > stdlib, so that e.g. Lexing could expose a from_stringlike). I think it would be quite important to have that in the stdlib: - This sets a standard for interoperability between libraries - The stdlib can exploit the details of the representation - It would be possible to use stringlike directly in C interfaces For instance, there is one module in OCamlnet where a regexp is directly run on an I/O buffer (generally, you need to do some primitive parsing on I/O buffers before you can extract strings, and that's where stringlike would be nice to have). Without stringlike, I would have to replace that regexp somehow. > Another=20 > variant of it is to see "stringlike" as a type class, implemented with=20 > explicit dictionaries. This could be done with records: >=20 > type 'a stringlike =3D { > get: 'a -> int -> char; > length: 'a -> int; > sub_string: 'a -> int -> int -> string; > output: out_channel -> 'a -> unit; > ... > } >=20 > There would be two constant records "string stringlike" and "bytes=20 > stringlike", and functions accepting either kind of string would take an= =20 > extra stringlike argument. (Alternatively, one could use first class=20 > modules instead of records.) There is some overhead related to the=20 > dynamic dispatch, but I'm not convinced this would be unacceptable.=20 The overhead is quite low. If you need to call e.g. "get" several times, you could factor out the dictionary lookup: let get =3D stringlike.get in ... The (only) price is that the access cannot be inlined anymore. > Your third idea (using char bigarrays) would then fit nicely in this=20 > approach. Right, and it would even be possible to use that for other buffer representations (e.g. I have a non-contiguous buffer type called Netpagebuffer in OCamlnet that could also be compatible with stringlike; also think of ring buffers). It's a really nice idea. The downside is that I cannot imagine any easy way to support this in C interfaces. Well, you could have low_level_buffer : 'a -> (Obj.t * int * int) that gets you a base address, an offset, and a length, but that could be too optimistic. Maybe C interfaces should simply dynamically check whether 'a is a string or bigarray, and fail otherwise. These dynamic checks are at least possible (maybe there could be a caml_stringlike_val function that does its very best). Another downside of this approach is that it introduces a lot of type variables. > Another direction would be to support also the case of functions which=20 > can return either a bytes or a string. A typical case is Bytes.sub /=20 > Bytes.sub_string. One could also want Bytes.cat_to_string: bytes ->=20 > bytes -> string in addition to Bytes.cat: bytes -> bytes -> bytes. For=20 > those cases, one could introduce a GADT such as: >=20 > type _ is_a_string =3D > | String: string is_a_string > | Bytes: bytes is_a_string > (* potentially more cases *) >=20 > You could then pass the desired constructor to the functions, e.g.:=20 > Bytes.sub: bytes -> int -> int -> 'a is_a_string -> 'a. The cost of=20 > dispatching on the constructor is tiny, and the stdlib could bypass the=20 > test altogether using unsafe features internally. Higher-level=20 > functions which can return either a string or a bytes are likely to=20 > produce the result by passing the is_a_string value down to lower-level=20 > functions. That's also a nice idea, and it will definitely save a few string copies here and there. > But one could also imagine that some function behave=20 > differently according to the actual type of result. For instance, a=20 > function which is likely to produce often the same strings could decide=20 > to keep a (weak) table of already returned strings, or to have a=20 > hard-coded list of common strings; this works only for immutable=20 > results, and so the function needs to check the is_a_string constructor=20 > to enable/disable these optimizations. The "stringlike" idea could also= =20 > be replaced by this is_a_string GADT, so that there could be a single=20 > function: >=20 > val sub: 'a is_a_string -> 'a -> int -> int -> 'b is_a_string -> 'b >=20 >=20 > All that said, I think the current situation is already a net=20 > improvement over the previous one,=20 Well, I wouldn't say so because I'm missing good migration paths for some important cases. > and that further layers can be built=20 > on top of it, if needed (and not necessarily in stdlib). Well, as pointed out, I'd really like to see one such layer in stdlib, because we'll otherwise have five different solutions in the library scene which are all incompatible to each other. (Your type class suggestion looks easy and will already solve most of the issues; why not just include it into the stdlib, it wouldn't need much: a new module Stringlike defining it, the records for String and Bytes and maybe char Bigarrays, and some extensions here and there where it is used, e.g. in Lexing.) IMHO, it is important to really provide practical solutions, and not only to theoretically have one. Gerd >=20 > Alain >=20 >=20 >=20 > On 07/04/2014 09:18 PM, Gerd Stolpmann wrote: > > Hi list, > > > > I've just posted a blog article where I criticize the new concept of > > immutable strings that will be available in OCaml 4.02 (as option): > > > > http://blog.camlcity.org/blog/bytes1.html > > > > In short my point is that it the new concept is not far reaching enough, > > and will even have negative impact on the code quality when it is not > > improved. I also present three ideas how to improve it. > > > > Gerd > > >=20 >=20 --=20 ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany gerd@gerd-stolpmann.de My OCaml site: http://www.camlcity.org Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------ --=-B0ZIjPNN4eGnTDcB5Miu Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJTu+LiAAoJEAaM4b9ZLB5T6OYH/3LScai7wPTvLKaj94t+CtmW JO0tSV5OuytRoZLr7vl9/AGISGVW1pJFdV73K9U6dLFFX3VgWE++yRIKAAOI9Ouf fT6oQGvevyQBNn+0WoTaUUBbHiDEW7nFYbUMl8mJbdZUwiIarzFKiJiTzTKctJNG RLJBkLG3BdIwt9YBXBkIzJF7RL/3UZUAKsqdRwEkX+T+rMI0DpLD7cVbdv2nLPGl H1ul6+7q9frUeVP9R7KsVe0U8v3zxnbE5ehQlO5/LDB+die7NMapQd1ZEovc9bgs 7JX0uBleTtQ6MMTLsH7/AQhE2O/gNY4/x8W6JaWziErcn5FMEUxwzfxR4mCseCw= =MA7x -----END PGP SIGNATURE----- --=-B0ZIjPNN4eGnTDcB5Miu--