From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by sympa.inria.fr (Postfix) with ESMTPS id 7F0277FA13 for ; Mon, 7 Jul 2014 14:42:23 +0200 (CEST) Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of alain@frisch.fr) identity=pra; client-ip=193.252.23.210; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="alain@frisch.fr"; x-sender="alain@frisch.fr"; x-conformance=sidf_compatible Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of alain@frisch.fr) identity=mailfrom; client-ip=193.252.23.210; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="alain@frisch.fr"; x-sender="alain@frisch.fr"; x-conformance=sidf_compatible Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of postmaster@msa.smtpout.orange.fr) identity=helo; client-ip=193.252.23.210; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="alain@frisch.fr"; x-sender="postmaster@msa.smtpout.orange.fr"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AiABAK6UulPB/BfSnGdsb2JhbABRCYNgxxMBgSoPAQEBAQEICwkJFCiEAwEBBScRLRMRCw4KCRYPCQMCAQIBRQYBDAgBARAHiCsJyUoXjQSBQyg6hEMFmnaBSIVHkEJq X-IPAS-Result: AiABAK6UulPB/BfSnGdsb2JhbABRCYNgxxMBgSoPAQEBAQEICwkJFCiEAwEBBScRLRMRCw4KCRYPCQMCAQIBRQYBDAgBARAHiCsJyUoXjQSBQyg6hEMFmnaBSIVHkEJq X-IronPort-AV: E=Sophos;i="5.01,618,1400018400"; d="scan'208";a="70410214" Received: from msa01.smtpout.orange.fr (HELO msa.smtpout.orange.fr) ([193.252.23.210]) by mail3-smtp-sop.national.inria.fr with ESMTP; 07 Jul 2014 14:42:22 +0200 Received: from [192.168.1.133] ([92.151.60.153]) by mwinf5d54 with ME id PQiM1o0053JNEic03QiMuY; Mon, 07 Jul 2014 14:42:21 +0200 X-ME-Helo: [192.168.1.133] X-ME-Auth: bGV4aWZpQHdhbmFkb28uZnI= X-ME-Date: Mon, 07 Jul 2014 14:42:21 +0200 X-ME-IP: 92.151.60.153 Message-ID: <53BA95AC.3050602@frisch.fr> Date: Mon, 07 Jul 2014 14:42:20 +0200 From: Alain Frisch User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Gerd Stolpmann , caml-list References: <1404501528.4384.4.camel@e130> In-Reply-To: <1404501528.4384.4.camel@e130> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Caml-list] Immutable strings Hi Gerd, Thanks for your interesting post. Your general point about not breaking backward compatibility at the source level, as long as only "basic" features are used, is important. Caml is now more than 30 years old (20 years for OCaml), and it would be very constraining not to prevent ourselves from fixing bugs in the language design, including when they are about core features. Some care need to be taken to provide a nice story to long-term users and a smooth migration path, using a combination of social means (interaction with the community) and technical ones (backward compatibility mode, "deprecated" warnings, sometimes tools to automate the transition). Even if we look only at industrial adoption, OCaml compete with languages more recently designed and if we cannot touch revisit existing choices, the risk is real for OCaml to appear "frozen", less modern, and a less compelling choice for new projects. This needs to be balanced against the risk of putting off owners of "passive" code bases (on which no dedicated development team work on a regular basis, but which need to be marginally modified and re-compiled once in a while). Concerning immutable strings, the migration path seems quite good to me: a warning tells you about direct uses of string-mutation features (such as String.set), and the default behavior does not break existing code. FWIW, it was a matter of hours to go through the entire LexiFi's code base to enable the new safe mode, and as always in such operations, it was a good opportunity to factorize similar code. And Jane Street does not seem overly worried by the task ( see https://blogs.janestreet.com/ocaml-4-02-everything-else/ ). As one of the problems with the current solution, you mention that conversion of strings to bytes and vice versa requires a copy, which incurs some performance penalty. This is true, but the new system can also avoid a lot of string copying (in safe mode). With mutable strings, a library which expects strings from the client code and depends on those strings to remain the same need to copy them, and similarly, it cannot return directly a string from its internal data structures, since the client could modify them and thus break internal invariants. (Many libraries don't do such copy, and, in the good cases, mention in their documentation that the strings should be treated as immutable ones by the caller. This is clearly a source of possibly tricky bugs.) The biggest problem you mention is related to the fact that in many contexts, both mutable and immutable strings could be relevant. Your first idea to address this problem is to consider bytes (mutable strings) as a subtype of (immutable) strings. This only addresses part of the problem: a library might still need to copy most strings on its boundaries to ensure a proper semantics; not only strings returned by the library as you mention, but also some strings passed from the client code to the library functions. Your second idea is to create a common supertype of both string and bytes, to be used in contexts which can consume either type. A minor variantiation would be to introduce a third abstract type, with "identity" injection from byte and string to it, and to expose the "read-only" part of the string API on it. This can entirely be implemented in user-land (although it could benefit from being in the stdlib, so that e.g. Lexing could expose a from_stringlike). Another variant of it is to see "stringlike" as a type class, implemented with explicit dictionaries. This could be done with records: type 'a stringlike = { get: 'a -> int -> char; length: 'a -> int; sub_string: 'a -> int -> int -> string; output: out_channel -> 'a -> unit; ... } There would be two constant records "string stringlike" and "bytes stringlike", and functions accepting either kind of string would take an extra stringlike argument. (Alternatively, one could use first class modules instead of records.) There is some overhead related to the dynamic dispatch, but I'm not convinced this would be unacceptable. Your third idea (using char bigarrays) would then fit nicely in this approach. Another direction would be to support also the case of functions which can return either a bytes or a string. A typical case is Bytes.sub / Bytes.sub_string. One could also want Bytes.cat_to_string: bytes -> bytes -> string in addition to Bytes.cat: bytes -> bytes -> bytes. For those cases, one could introduce a GADT such as: type _ is_a_string = | String: string is_a_string | Bytes: bytes is_a_string (* potentially more cases *) You could then pass the desired constructor to the functions, e.g.: Bytes.sub: bytes -> int -> int -> 'a is_a_string -> 'a. The cost of dispatching on the constructor is tiny, and the stdlib could bypass the test altogether using unsafe features internally. Higher-level functions which can return either a string or a bytes are likely to produce the result by passing the is_a_string value down to lower-level functions. But one could also imagine that some function behave differently according to the actual type of result. For instance, a function which is likely to produce often the same strings could decide to keep a (weak) table of already returned strings, or to have a hard-coded list of common strings; this works only for immutable results, and so the function needs to check the is_a_string constructor to enable/disable these optimizations. The "stringlike" idea could also be replaced by this is_a_string GADT, so that there could be a single function: val sub: 'a is_a_string -> 'a -> int -> int -> 'b is_a_string -> 'b All that said, I think the current situation is already a net improvement over the previous one, and that further layers can be built on top of it, if needed (and not necessarily in stdlib). Alain On 07/04/2014 09:18 PM, Gerd Stolpmann wrote: > Hi list, > > I've just posted a blog article where I criticize the new concept of > immutable strings that will be available in OCaml 4.02 (as option): > > http://blog.camlcity.org/blog/bytes1.html > > In short my point is that it the new concept is not far reaching enough, > and will even have negative impact on the code quality when it is not > improved. I also present three ideas how to improve it. > > Gerd >