Re: [Caml-list] Immutable strings

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: Alain Frisch <alain@frisch.fr>
Cc: caml-list <caml-list@inria.fr>
Subject: Re: [Caml-list] Immutable strings
Date: Tue, 08 Jul 2014 14:24:02 +0200	[thread overview]
Message-ID: <1404822242.4384.101.camel@e130> (raw)
In-Reply-To: <53BA95AC.3050602@frisch.fr>

[-- Attachment #1: Type: text/plain, Size: 9956 bytes --]

Am Montag, den 07.07.2014, 14:42 +0200 schrieb Alain Frisch:
> Hi Gerd,
> 
> Thanks for your interesting post.  Your general point about not breaking 
> backward compatibility at the source level, as long as only "basic" 
> features are used, is important. ... Even if we look only at 
> industrial adoption, OCaml compete with languages more recently designed 
> and if we cannot touch revisit existing choices, the risk is real for 
> OCaml to appear "frozen", less modern, and a less compelling choice for 
> new projects.  This needs to be balanced against the risk of putting off 
> owners of "passive" code bases (on which no dedicated development team 
> work on a regular basis, but which need to be marginally modified and 
> re-compiled once in a while).

It will create confusion even with actively maintained code bases. What
could help here is very clear communication when the change will be the
standard behavior, and how the migration will take place. Currently, it
feels like a big experiment - hey, let's users tentatively enable it,
and watch out for problems. That's quite naive. In particular, users
hitting problems will probably not try out the switch (or immediately
revert), because leaving the code base in a non-buildable state for
longer time is not an option. (And ignoring these users would not be
good, because it's exactly these users who are really doing string
mutation who could profit at most from the change.)

> Concerning immutable strings, the migration path seems quite good to me: 
> a warning tells you about direct uses of string-mutation features (such 
> as String.set), and the default behavior does not break existing code. 

That's good for now, but I'm more expecting something like: next ocaml
version it is experimental (interfaces may still evolve). The following
version it is recommended standard and we'll emit a warning when
-safe-strings is not on. The version after that we'll make -safe-strings
the default, etc. Something like that. There could also be a section in
the manual explaining the new behavior, and how to convert code.

> FWIW, it was a matter of hours to go through the entire LexiFi's code 
> base to enable the new safe mode, and as always in such operations, it 
> was a good opportunity to factorize similar code.  And Jane Street does 
> not seem overly worried by the task ( see 
> https://blogs.janestreet.com/ocaml-4-02-everything-else/ ).

With my current customer, I don't see any bigger problems either,
because string mutation doesn't play a big role there (it's a compiler
project). I see a big problem with OCamlnet, though, as it is focused on
I/O, and the issue how to deal with buffers is quite central.

> As one of the problems with the current solution, you mention that 
> conversion of strings to bytes and vice versa requires a copy, which 
> incurs some performance penalty.  This is true, but the new system can 
> also avoid a lot of string copying (in safe mode).  ...
>  (Many libraries don't do such copy, and, in the good cases, 
> mention in their documentation that the strings should be treated as 
> immutable ones by the caller.  This is clearly a source of possibly 
> tricky bugs.)

Right, that's the good side of it. (Although the danger is quite
theoretical, as most programmers seem to intuitively follow the rule
"don't mutate strings you did not create". I've never seen this kind of
bug in practice.)

> Your second idea is to create a common supertype of both string and 
> bytes, to be used in contexts which can consume either type.  A minor 
> variantiation would be to introduce a third abstract type, with 
> "identity" injection from byte and string to it, and to expose the 
> "read-only" part of the string API on it.    This can entirely be 
> implemented in user-land (although it could benefit from being in the 
> stdlib, so that e.g. Lexing could expose a from_stringlike).

I think it would be quite important to have that in the stdlib:

 - This sets a standard for interoperability between libraries
 - The stdlib can exploit the details of the representation
 - It would be possible to use stringlike directly in C interfaces

For instance, there is one module in OCamlnet where a regexp is directly
run on an I/O buffer (generally, you need to do some primitive parsing
on I/O buffers before you can extract strings, and that's where
stringlike would be nice to have). Without stringlike, I would have to
replace that regexp somehow.

>    Another 
> variant of it is to see "stringlike" as a type class, implemented with 
> explicit dictionaries.  This could be done with records:
> 
>    type 'a stringlike = {
>      get: 'a -> int -> char;
>      length: 'a -> int;
>      sub_string: 'a -> int -> int -> string;
>      output: out_channel -> 'a -> unit;
>      ...
>    }
> 
> There would be two constant records "string stringlike" and "bytes 
> stringlike", and functions accepting either kind of string would take an 
> extra stringlike argument.  (Alternatively, one could use first class 
> modules instead of records.)  There is some overhead related to the 
> dynamic dispatch, but I'm not convinced this would be unacceptable. 

The overhead is quite low. If you need to call e.g. "get" several times,
you could factor out the dictionary lookup:

let get = stringlike.get in ...

The (only) price is that the access cannot be inlined anymore.

> Your third idea (using char bigarrays) would then fit nicely in this 
> approach.

Right, and it would even be possible to use that for other buffer
representations (e.g. I have a non-contiguous buffer type called
Netpagebuffer in OCamlnet that could also be compatible with stringlike;
also think of ring buffers). It's a really nice idea.

The downside is that I cannot imagine any easy way to support this in C
interfaces. Well, you could have

low_level_buffer : 'a -> (Obj.t * int * int)

that gets you a base address, an offset, and a length, but that could be
too optimistic. Maybe C interfaces should simply dynamically check
whether 'a is a string or bigarray, and fail otherwise. These dynamic
checks are at least possible (maybe there could be a caml_stringlike_val
function that does its very best).

Another downside of this approach is that it introduces a lot of type
variables.

> Another direction would be to support also the case of functions which 
> can return either a bytes or a string.  A typical case is Bytes.sub / 
> Bytes.sub_string.  One could also want Bytes.cat_to_string: bytes -> 
> bytes -> string in addition to Bytes.cat: bytes -> bytes -> bytes.  For 
> those cases, one could introduce a GADT such as:
> 
>   type _ is_a_string =
>      | String: string is_a_string
>      | Bytes: bytes is_a_string
>      (* potentially more cases *)
> 
> You could then pass the desired constructor to the functions, e.g.: 
> Bytes.sub: bytes -> int -> int -> 'a is_a_string -> 'a.  The cost of 
> dispatching on the constructor is tiny, and the stdlib could bypass the 
> test altogether using unsafe features internally.  Higher-level 
> functions which can return either a string or a bytes are likely to 
> produce the result by passing the is_a_string value down to lower-level 
> functions.

That's also a nice idea, and it will definitely save a few string copies
here and there.

>   But one could also imagine that some function behave 
> differently according to the actual type of result.  For instance, a 
> function which is likely to produce often the same strings could decide 
> to keep a (weak) table of already returned strings, or to have a 
> hard-coded list of common strings; this works only for immutable 
> results, and so the function needs to check the is_a_string constructor 
> to enable/disable these optimizations.  The "stringlike" idea could also 
> be replaced by this is_a_string GADT, so that there could be a single 
> function:
> 
>   val sub: 'a is_a_string -> 'a -> int -> int -> 'b is_a_string -> 'b
> 
> 
> All that said, I think the current situation is already a net 
> improvement over the previous one, 

Well, I wouldn't say so because I'm missing good migration paths for
some important cases.

> and that further layers can be built 
> on top of it, if needed (and not necessarily in stdlib).

Well, as pointed out, I'd really like to see one such layer in stdlib,
because we'll otherwise have five different solutions in the library
scene which are all incompatible to each other. (Your type class
suggestion looks easy and will already solve most of the issues; why not
just include it into the stdlib, it wouldn't need much: a new module
Stringlike defining it, the records for String and Bytes and maybe char
Bigarrays, and some extensions here and there where it is used, e.g. in
Lexing.) IMHO, it is important to really provide practical solutions,
and not only to theoretically have one.

Gerd

> 
> Alain
> 
> 
> 
> On 07/04/2014 09:18 PM, Gerd Stolpmann wrote:
> > Hi list,
> >
> > I've just posted a blog article where I criticize the new concept of
> > immutable strings that will be available in OCaml 4.02 (as option):
> >
> > http://blog.camlcity.org/blog/bytes1.html
> >
> > In short my point is that it the new concept is not far reaching enough,
> > and will even have negative impact on the code quality when it is not
> > improved. I also present three ideas how to improve it.
> >
> > Gerd
> >
> 
> 

-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
My OCaml site:          http://www.camlcity.org
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

next prev parent reply	other threads:[~2014-07-08 12:24 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-04 19:18 Gerd Stolpmann
2014-07-04 20:31 ` Anthony Tavener
2014-07-04 20:38   ` Malcolm Matalka
2014-07-04 23:44   ` Daniel Bünzli
2014-07-05 11:04   ` Gerd Stolpmann
2014-07-16 11:38     ` Damien Doligez
2014-07-04 21:01 ` Markus Mottl
2014-07-05 11:24   ` Gerd Stolpmann
2014-07-08 13:23     ` Jacques Garrigue
2014-07-08 13:37       ` Alain Frisch
2014-07-08 14:04         ` Jacques Garrigue
2014-07-28 11:14   ` Goswin von Brederlow
2014-07-28 15:51     ` Markus Mottl
2014-07-29  2:54       ` Yaron Minsky
2014-07-29  9:46         ` Goswin von Brederlow
2014-07-29 11:48         ` John F. Carr
2014-07-07 12:42 ` Alain Frisch
2014-07-08 12:24   ` Gerd Stolpmann [this message]
2014-07-09 13:54     ` Alain Frisch
2014-07-09 18:04       ` Gerd Stolpmann
2014-07-10  6:41         ` Nicolas Boulay
2014-07-14 17:40       ` Richard W.M. Jones
2014-07-08 18:15 ` mattiasw
2014-07-08 19:24   ` Daniel Bünzli
2014-07-08 19:27     ` Raoul Duke
2014-07-09 14:15   ` Daniel Bünzli
2014-07-14 17:45   ` Richard W.M. Jones
2014-07-21 15:06 ` Alain Frisch
     [not found]   ` <20140722.235104.405798419265248505.Christophe.Troestler@umons.ac.be>
2014-08-29 16:30     ` Damien Doligez

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1404822242.4384.101.camel@e130 \
    --to=info@gerd-stolpmann.de \
    --cc=alain@frisch.fr \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).