caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Immutable strings
@ 2014-07-04 19:18 Gerd Stolpmann
  2014-07-04 20:31 ` Anthony Tavener
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Gerd Stolpmann @ 2014-07-04 19:18 UTC (permalink / raw)
  To: caml-list

Hi list,

I've just posted a blog article where I criticize the new concept of
immutable strings that will be available in OCaml 4.02 (as option):

http://blog.camlcity.org/blog/bytes1.html

In short my point is that it the new concept is not far reaching enough,
and will even have negative impact on the code quality when it is not
improved. I also present three ideas how to improve it.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
My OCaml site:          http://www.camlcity.org
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-04 19:18 [Caml-list] Immutable strings Gerd Stolpmann
@ 2014-07-04 20:31 ` Anthony Tavener
  2014-07-04 20:38   ` Malcolm Matalka
                     ` (2 more replies)
  2014-07-04 21:01 ` Markus Mottl
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 29+ messages in thread
From: Anthony Tavener @ 2014-07-04 20:31 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 2238 bytes --]

I'm rather welcoming of the immutable change (hehe) of strings, but I
haven't
considered these details -- perhaps because I only use strings as immutable
(currently with no such guarantee!), and use bigarray for a block of mutable
bytes... which is your idea #3.

It seems the "bytes" type would be most useful in cases where mutable and
immutable strings are used in a mixed manner... but given these practical
issues you raise, it could be less pleasant than it first appears. Your
"stringlike" solution seems reasonable, but I don't have a good use-case in
mind for mixed mutable/immutable to help me imagine the result. What are
some
scenarios where this mix of types is desired? I think even Rust doesn't
support mutable strings -- which seems bold for its target audience, yet
they're fine with it?

When I consider possible scenarios of utf8 encoded strings, and mutating
that
in-place... ugh. Even "back in the day", doing string operations in C on
ASCII, I'd favor building a new string rather than flirting with saving ops
by
overwriting values in the current string. Oh! Upper/lower-case! Maybe that's
the one good use-case. ;)



On Fri, Jul 4, 2014 at 1:18 PM, Gerd Stolpmann <info@gerd-stolpmann.de>
wrote:

> Hi list,
>
> I've just posted a blog article where I criticize the new concept of
> immutable strings that will be available in OCaml 4.02 (as option):
>
> http://blog.camlcity.org/blog/bytes1.html
>
> In short my point is that it the new concept is not far reaching enough,
> and will even have negative impact on the code quality when it is not
> improved. I also present three ideas how to improve it.
>
> Gerd
> --
> ------------------------------------------------------------
> Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
> My OCaml site:          http://www.camlcity.org
> Contact details:        http://www.camlcity.org/contact.html
> Company homepage:       http://www.gerd-stolpmann.de
> ------------------------------------------------------------
>
>
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

[-- Attachment #2: Type: text/html, Size: 3497 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-04 20:31 ` Anthony Tavener
@ 2014-07-04 20:38   ` Malcolm Matalka
  2014-07-04 23:44   ` Daniel Bünzli
  2014-07-05 11:04   ` Gerd Stolpmann
  2 siblings, 0 replies; 29+ messages in thread
From: Malcolm Matalka @ 2014-07-04 20:38 UTC (permalink / raw)
  To: Anthony Tavener; +Cc: Gerd Stolpmann, caml-list

I haven't really been following this but I'm curious why a new type,
rstring, was not introduced?

But, as for the actual impact on the community.  This seems like a
question the OPAM team can answer now, right?  They can compile every
package with immutable strings turned on and see how many fail?  That
would give an idea of the impact and possibly suggest a migration path
or an alternative approach.

/M

Anthony Tavener <anthony.tavener@gmail.com> writes:

> I'm rather welcoming of the immutable change (hehe) of strings, but I
> haven't
> considered these details -- perhaps because I only use strings as immutable
> (currently with no such guarantee!), and use bigarray for a block of mutable
> bytes... which is your idea #3.
>
> It seems the "bytes" type would be most useful in cases where mutable and
> immutable strings are used in a mixed manner... but given these practical
> issues you raise, it could be less pleasant than it first appears. Your
> "stringlike" solution seems reasonable, but I don't have a good use-case in
> mind for mixed mutable/immutable to help me imagine the result. What are
> some
> scenarios where this mix of types is desired? I think even Rust doesn't
> support mutable strings -- which seems bold for its target audience, yet
> they're fine with it?
>
> When I consider possible scenarios of utf8 encoded strings, and mutating
> that
> in-place... ugh. Even "back in the day", doing string operations in C on
> ASCII, I'd favor building a new string rather than flirting with saving ops
> by
> overwriting values in the current string. Oh! Upper/lower-case! Maybe that's
> the one good use-case. ;)
>
>
>
> On Fri, Jul 4, 2014 at 1:18 PM, Gerd Stolpmann <info@gerd-stolpmann.de>
> wrote:
>
>> Hi list,
>>
>> I've just posted a blog article where I criticize the new concept of
>> immutable strings that will be available in OCaml 4.02 (as option):
>>
>> http://blog.camlcity.org/blog/bytes1.html
>>
>> In short my point is that it the new concept is not far reaching enough,
>> and will even have negative impact on the code quality when it is not
>> improved. I also present three ideas how to improve it.
>>
>> Gerd
>> --
>> ------------------------------------------------------------
>> Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
>> My OCaml site:          http://www.camlcity.org
>> Contact details:        http://www.camlcity.org/contact.html
>> Company homepage:       http://www.gerd-stolpmann.de
>> ------------------------------------------------------------
>>
>>
>>
>> --
>> Caml-list mailing list.  Subscription management and archives:
>> https://sympa.inria.fr/sympa/arc/caml-list
>> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>> Bug reports: http://caml.inria.fr/bin/caml-bugs
>>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-04 19:18 [Caml-list] Immutable strings Gerd Stolpmann
  2014-07-04 20:31 ` Anthony Tavener
@ 2014-07-04 21:01 ` Markus Mottl
  2014-07-05 11:24   ` Gerd Stolpmann
  2014-07-28 11:14   ` Goswin von Brederlow
  2014-07-07 12:42 ` Alain Frisch
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 29+ messages in thread
From: Markus Mottl @ 2014-07-04 21:01 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: caml-list

I agree that the new concept has some noteworthy downsides as
demonstrated in the Lexing-example.  Your proposed solution 2
(stringlike) would probably solve these issues from a safety point of
view.  The downside is that the complexity of string-handling would
increase even more, because then we would have three types to deal
with.  I personally prefer safety over convenience, but other people's
(especially beginner's) mileage may vary.

The Bigarray-approach doesn't seem appealing to me.  Strings are much
more lightweight, since they can be allocated cheaply on the
OCaml-heap.  E.g. String.create is about 10x-100x faster than
Bigarray.create.  That seems too big to ignore.

Regards,
Markus

On Fri, Jul 4, 2014 at 3:18 PM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote:
> Hi list,
>
> I've just posted a blog article where I criticize the new concept of
> immutable strings that will be available in OCaml 4.02 (as option):
>
> http://blog.camlcity.org/blog/bytes1.html
>
> In short my point is that it the new concept is not far reaching enough,
> and will even have negative impact on the code quality when it is not
> improved. I also present three ideas how to improve it.
>
> Gerd
> --
> ------------------------------------------------------------
> Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
> My OCaml site:          http://www.camlcity.org
> Contact details:        http://www.camlcity.org/contact.html
> Company homepage:       http://www.gerd-stolpmann.de
> ------------------------------------------------------------
>
>
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs



-- 
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-04 20:31 ` Anthony Tavener
  2014-07-04 20:38   ` Malcolm Matalka
@ 2014-07-04 23:44   ` Daniel Bünzli
  2014-07-05 11:04   ` Gerd Stolpmann
  2 siblings, 0 replies; 29+ messages in thread
From: Daniel Bünzli @ 2014-07-04 23:44 UTC (permalink / raw)
  To: Anthony Tavener; +Cc: Gerd Stolpmann, caml-list



Le vendredi, 4 juillet 2014 à 21:31, Anthony Tavener a écrit :

> When I consider possible scenarios of utf8 encoded strings, and mutating that
> in-place... ugh. Even "back in the day", doing string operations in C on
> ASCII, I'd favor building a new string rather than flirting with saving ops by
> overwriting values in the current string. Oh! Upper/lower-case! Maybe that's
> the one good use-case. ;)

Not even… that is if you care about Unicode, e.g.:

# Uucp.Case.Map.to_upper 0xFB01;;
- : [ `Self | `Uchars of Uucp.uchar list ] = `Uchars [U+0046; U+0049]



Best,

Daniel



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-04 20:31 ` Anthony Tavener
  2014-07-04 20:38   ` Malcolm Matalka
  2014-07-04 23:44   ` Daniel Bünzli
@ 2014-07-05 11:04   ` Gerd Stolpmann
  2014-07-16 11:38     ` Damien Doligez
  2 siblings, 1 reply; 29+ messages in thread
From: Gerd Stolpmann @ 2014-07-05 11:04 UTC (permalink / raw)
  To: Anthony Tavener; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 3588 bytes --]

Am Freitag, den 04.07.2014, 14:31 -0600 schrieb Anthony Tavener:
> It seems the "bytes" type would be most useful in cases where mutable
> and
> immutable strings are used in a mixed manner... but given these
> practical
> issues you raise, it could be less pleasant than it first appears.
> Your
> "stringlike" solution seems reasonable, but I don't have a good
> use-case in
> mind for mixed mutable/immutable to help me imagine the result. What
> are some
> scenarios where this mix of types is desired? I think even Rust
> doesn't
> support mutable strings -- which seems bold for its target audience,
> yet
> they're fine with it?

I've mostly buffers in mind, as you need them for block-by-block I/O.
Actually, I started thinking about this issue when looking again at
OCamlnet, and how I could use "bytes" there. It's a hard case, lots of
buffers of different types, and you really run into the problems I
sketched in the article, as it is a common operation to copy the
contents of one buffer into the other.

That's also why I'm suggesting to use bigarrays - for interfacing with C
these are much easier to use as buffers, as bigarrays are just malloc'ed
memory and cannot be moved around by the GC. (And the C interface is
needed for I/O.)

So my scenario is quite low-level: I/O, and C interfaces.

> When I consider possible scenarios of utf8 encoded strings,

No, that's a no-go, of course. When it comes to real text, mutability
doesn't give you much.

Gerd


>  and mutating that
> in-place... ugh. Even "back in the day", doing string operations in C
> on
> ASCII, I'd favor building a new string rather than flirting with
> saving ops by
> overwriting values in the current string. Oh! Upper/lower-case! Maybe
> that's
> the one good use-case. ;)
> 
> 
> 
> 
> On Fri, Jul 4, 2014 at 1:18 PM, Gerd Stolpmann
> <info@gerd-stolpmann.de> wrote:
>         Hi list,
>         
>         I've just posted a blog article where I criticize the new
>         concept of
>         immutable strings that will be available in OCaml 4.02 (as
>         option):
>         
>         http://blog.camlcity.org/blog/bytes1.html
>         
>         In short my point is that it the new concept is not far
>         reaching enough,
>         and will even have negative impact on the code quality when it
>         is not
>         improved. I also present three ideas how to improve it.
>         
>         Gerd
>         --
>         ------------------------------------------------------------
>         Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
>         My OCaml site:          http://www.camlcity.org
>         Contact details:        http://www.camlcity.org/contact.html
>         Company homepage:       http://www.gerd-stolpmann.de
>         ------------------------------------------------------------
>         
>         
>         
>         --
>         Caml-list mailing list.  Subscription management and archives:
>         https://sympa.inria.fr/sympa/arc/caml-list
>         Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>         Bug reports: http://caml.inria.fr/bin/caml-bugs
> 
> 

-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
My OCaml site:          http://www.camlcity.org
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-04 21:01 ` Markus Mottl
@ 2014-07-05 11:24   ` Gerd Stolpmann
  2014-07-08 13:23     ` Jacques Garrigue
  2014-07-28 11:14   ` Goswin von Brederlow
  1 sibling, 1 reply; 29+ messages in thread
From: Gerd Stolpmann @ 2014-07-05 11:24 UTC (permalink / raw)
  To: Markus Mottl; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 3664 bytes --]

Am Freitag, den 04.07.2014, 17:01 -0400 schrieb Markus Mottl:
> I agree that the new concept has some noteworthy downsides as
> demonstrated in the Lexing-example.  Your proposed solution 2
> (stringlike) would probably solve these issues from a safety point of
> view.  The downside is that the complexity of string-handling would
> increase even more, because then we would have three types to deal
> with.  I personally prefer safety over convenience, but other people's
> (especially beginner's) mileage may vary.

Well, the complexity can be reduced a bit by using phantom types:

type string = [`String] stringlike
type bytes = [`Bytes] stringlike

and then just define function-by-function what is permitted:

val get : 'a stringlike -> int -> char
val set : [`Bytes] stringlike -> int -> char -> unit
val sub : 'a stringlike -> int -> int -> [`String] stringlike
val sub_bytes : 'a stringlike -> int -> int -> [`Bytes] stringlike

etc., and the modules String and Bytes would just contain aliases of
these functions with monomorphed typing.

I don't know, though, whether we can be safe to never see the
polymorphic typing when just using string and bytes. It would be a bit
surprising for beginners to see that, and you sometimes would have to
deal with unresolved type variables.

> The Bigarray-approach doesn't seem appealing to me.  Strings are much
> more lightweight, since they can be allocated cheaply on the
> OCaml-heap.  E.g. String.create is about 10x-100x faster than
> Bigarray.create.  That seems too big to ignore.

Oh, we ignore already that Unix.read and Unix.write copy all data
through an additional buffer because we cannot pass an OCaml string
directly to the OS while another thread could relocate this string. So
that copy would be eliminated. So I'd guess you are normally even faster
with bigarrays, at least when you only look at the use as I/O buffers.
But there might be other uses where this is different.

Gerd



> 
> Regards,
> Markus
> 
> On Fri, Jul 4, 2014 at 3:18 PM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote:
> > Hi list,
> >
> > I've just posted a blog article where I criticize the new concept of
> > immutable strings that will be available in OCaml 4.02 (as option):
> >
> > http://blog.camlcity.org/blog/bytes1.html
> >
> > In short my point is that it the new concept is not far reaching enough,
> > and will even have negative impact on the code quality when it is not
> > improved. I also present three ideas how to improve it.
> >
> > Gerd
> > --
> > ------------------------------------------------------------
> > Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
> > My OCaml site:          http://www.camlcity.org
> > Contact details:        http://www.camlcity.org/contact.html
> > Company homepage:       http://www.gerd-stolpmann.de
> > ------------------------------------------------------------
> >
> >
> >
> > --
> > Caml-list mailing list.  Subscription management and archives:
> > https://sympa.inria.fr/sympa/arc/caml-list
> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > Bug reports: http://caml.inria.fr/bin/caml-bugs
> 
> 
> 
> -- 
> Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com
> 

-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
My OCaml site:          http://www.camlcity.org
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-04 19:18 [Caml-list] Immutable strings Gerd Stolpmann
  2014-07-04 20:31 ` Anthony Tavener
  2014-07-04 21:01 ` Markus Mottl
@ 2014-07-07 12:42 ` Alain Frisch
  2014-07-08 12:24   ` Gerd Stolpmann
  2014-07-08 18:15 ` mattiasw
  2014-07-21 15:06 ` Alain Frisch
  4 siblings, 1 reply; 29+ messages in thread
From: Alain Frisch @ 2014-07-07 12:42 UTC (permalink / raw)
  To: Gerd Stolpmann, caml-list

Hi Gerd,

Thanks for your interesting post.  Your general point about not breaking 
backward compatibility at the source level, as long as only "basic" 
features are used, is important.  Caml is now more than 30 years old (20 
years for OCaml), and it would be very constraining not to prevent 
ourselves from fixing bugs in the language design, including when they 
are about core features.  Some care need to be taken to provide a nice 
story to long-term users and a smooth migration path, using a 
combination of social means (interaction with the community) and 
technical ones (backward compatibility mode, "deprecated" warnings, 
sometimes tools to automate the transition).  Even if we look only at 
industrial adoption, OCaml compete with languages more recently designed 
and if we cannot touch revisit existing choices, the risk is real for 
OCaml to appear "frozen", less modern, and a less compelling choice for 
new projects.  This needs to be balanced against the risk of putting off 
owners of "passive" code bases (on which no dedicated development team 
work on a regular basis, but which need to be marginally modified and 
re-compiled once in a while).

Concerning immutable strings, the migration path seems quite good to me: 
a warning tells you about direct uses of string-mutation features (such 
as String.set), and the default behavior does not break existing code. 
FWIW, it was a matter of hours to go through the entire LexiFi's code 
base to enable the new safe mode, and as always in such operations, it 
was a good opportunity to factorize similar code.  And Jane Street does 
not seem overly worried by the task ( see 
https://blogs.janestreet.com/ocaml-4-02-everything-else/ ).


As one of the problems with the current solution, you mention that 
conversion of strings to bytes and vice versa requires a copy, which 
incurs some performance penalty.  This is true, but the new system can 
also avoid a lot of string copying (in safe mode).  With mutable 
strings, a library which expects strings from the client code and 
depends on those strings to remain the same need to copy them, and 
similarly, it cannot return directly a string from its internal data 
structures, since the client could modify them and thus break internal 
invariants.  (Many libraries don't do such copy, and, in the good cases, 
mention in their documentation that the strings should be treated as 
immutable ones by the caller.  This is clearly a source of possibly 
tricky bugs.)

The biggest problem you mention is related to the fact that in many 
contexts, both mutable and immutable strings could be relevant.  Your 
first idea to address this problem is to consider bytes (mutable 
strings) as a subtype of (immutable) strings.  This only addresses part 
of the problem: a library might still need to copy most strings on its 
boundaries to ensure a proper semantics; not only strings returned by 
the library as you mention, but also some strings passed from the client 
code to the library functions.

Your second idea is to create a common supertype of both string and 
bytes, to be used in contexts which can consume either type.  A minor 
variantiation would be to introduce a third abstract type, with 
"identity" injection from byte and string to it, and to expose the 
"read-only" part of the string API on it.    This can entirely be 
implemented in user-land (although it could benefit from being in the 
stdlib, so that e.g. Lexing could expose a from_stringlike).   Another 
variant of it is to see "stringlike" as a type class, implemented with 
explicit dictionaries.  This could be done with records:

   type 'a stringlike = {
     get: 'a -> int -> char;
     length: 'a -> int;
     sub_string: 'a -> int -> int -> string;
     output: out_channel -> 'a -> unit;
     ...
   }

There would be two constant records "string stringlike" and "bytes 
stringlike", and functions accepting either kind of string would take an 
extra stringlike argument.  (Alternatively, one could use first class 
modules instead of records.)  There is some overhead related to the 
dynamic dispatch, but I'm not convinced this would be unacceptable. 
Your third idea (using char bigarrays) would then fit nicely in this 
approach.

Another direction would be to support also the case of functions which 
can return either a bytes or a string.  A typical case is Bytes.sub / 
Bytes.sub_string.  One could also want Bytes.cat_to_string: bytes -> 
bytes -> string in addition to Bytes.cat: bytes -> bytes -> bytes.  For 
those cases, one could introduce a GADT such as:

  type _ is_a_string =
     | String: string is_a_string
     | Bytes: bytes is_a_string
     (* potentially more cases *)

You could then pass the desired constructor to the functions, e.g.: 
Bytes.sub: bytes -> int -> int -> 'a is_a_string -> 'a.  The cost of 
dispatching on the constructor is tiny, and the stdlib could bypass the 
test altogether using unsafe features internally.  Higher-level 
functions which can return either a string or a bytes are likely to 
produce the result by passing the is_a_string value down to lower-level 
functions.  But one could also imagine that some function behave 
differently according to the actual type of result.  For instance, a 
function which is likely to produce often the same strings could decide 
to keep a (weak) table of already returned strings, or to have a 
hard-coded list of common strings; this works only for immutable 
results, and so the function needs to check the is_a_string constructor 
to enable/disable these optimizations.  The "stringlike" idea could also 
be replaced by this is_a_string GADT, so that there could be a single 
function:

  val sub: 'a is_a_string -> 'a -> int -> int -> 'b is_a_string -> 'b


All that said, I think the current situation is already a net 
improvement over the previous one, and that further layers can be built 
on top of it, if needed (and not necessarily in stdlib).


Alain



On 07/04/2014 09:18 PM, Gerd Stolpmann wrote:
> Hi list,
>
> I've just posted a blog article where I criticize the new concept of
> immutable strings that will be available in OCaml 4.02 (as option):
>
> http://blog.camlcity.org/blog/bytes1.html
>
> In short my point is that it the new concept is not far reaching enough,
> and will even have negative impact on the code quality when it is not
> improved. I also present three ideas how to improve it.
>
> Gerd
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-07 12:42 ` Alain Frisch
@ 2014-07-08 12:24   ` Gerd Stolpmann
  2014-07-09 13:54     ` Alain Frisch
  0 siblings, 1 reply; 29+ messages in thread
From: Gerd Stolpmann @ 2014-07-08 12:24 UTC (permalink / raw)
  To: Alain Frisch; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 9956 bytes --]

Am Montag, den 07.07.2014, 14:42 +0200 schrieb Alain Frisch:
> Hi Gerd,
> 
> Thanks for your interesting post.  Your general point about not breaking 
> backward compatibility at the source level, as long as only "basic" 
> features are used, is important. ... Even if we look only at 
> industrial adoption, OCaml compete with languages more recently designed 
> and if we cannot touch revisit existing choices, the risk is real for 
> OCaml to appear "frozen", less modern, and a less compelling choice for 
> new projects.  This needs to be balanced against the risk of putting off 
> owners of "passive" code bases (on which no dedicated development team 
> work on a regular basis, but which need to be marginally modified and 
> re-compiled once in a while).

It will create confusion even with actively maintained code bases. What
could help here is very clear communication when the change will be the
standard behavior, and how the migration will take place. Currently, it
feels like a big experiment - hey, let's users tentatively enable it,
and watch out for problems. That's quite naive. In particular, users
hitting problems will probably not try out the switch (or immediately
revert), because leaving the code base in a non-buildable state for
longer time is not an option. (And ignoring these users would not be
good, because it's exactly these users who are really doing string
mutation who could profit at most from the change.)

> Concerning immutable strings, the migration path seems quite good to me: 
> a warning tells you about direct uses of string-mutation features (such 
> as String.set), and the default behavior does not break existing code. 

That's good for now, but I'm more expecting something like: next ocaml
version it is experimental (interfaces may still evolve). The following
version it is recommended standard and we'll emit a warning when
-safe-strings is not on. The version after that we'll make -safe-strings
the default, etc. Something like that. There could also be a section in
the manual explaining the new behavior, and how to convert code.

> FWIW, it was a matter of hours to go through the entire LexiFi's code 
> base to enable the new safe mode, and as always in such operations, it 
> was a good opportunity to factorize similar code.  And Jane Street does 
> not seem overly worried by the task ( see 
> https://blogs.janestreet.com/ocaml-4-02-everything-else/ ).

With my current customer, I don't see any bigger problems either,
because string mutation doesn't play a big role there (it's a compiler
project). I see a big problem with OCamlnet, though, as it is focused on
I/O, and the issue how to deal with buffers is quite central.

> As one of the problems with the current solution, you mention that 
> conversion of strings to bytes and vice versa requires a copy, which 
> incurs some performance penalty.  This is true, but the new system can 
> also avoid a lot of string copying (in safe mode).  ...
>  (Many libraries don't do such copy, and, in the good cases, 
> mention in their documentation that the strings should be treated as 
> immutable ones by the caller.  This is clearly a source of possibly 
> tricky bugs.)

Right, that's the good side of it. (Although the danger is quite
theoretical, as most programmers seem to intuitively follow the rule
"don't mutate strings you did not create". I've never seen this kind of
bug in practice.)

> Your second idea is to create a common supertype of both string and 
> bytes, to be used in contexts which can consume either type.  A minor 
> variantiation would be to introduce a third abstract type, with 
> "identity" injection from byte and string to it, and to expose the 
> "read-only" part of the string API on it.    This can entirely be 
> implemented in user-land (although it could benefit from being in the 
> stdlib, so that e.g. Lexing could expose a from_stringlike).

I think it would be quite important to have that in the stdlib:

 - This sets a standard for interoperability between libraries
 - The stdlib can exploit the details of the representation
 - It would be possible to use stringlike directly in C interfaces

For instance, there is one module in OCamlnet where a regexp is directly
run on an I/O buffer (generally, you need to do some primitive parsing
on I/O buffers before you can extract strings, and that's where
stringlike would be nice to have). Without stringlike, I would have to
replace that regexp somehow.

>    Another 
> variant of it is to see "stringlike" as a type class, implemented with 
> explicit dictionaries.  This could be done with records:
> 
>    type 'a stringlike = {
>      get: 'a -> int -> char;
>      length: 'a -> int;
>      sub_string: 'a -> int -> int -> string;
>      output: out_channel -> 'a -> unit;
>      ...
>    }
> 
> There would be two constant records "string stringlike" and "bytes 
> stringlike", and functions accepting either kind of string would take an 
> extra stringlike argument.  (Alternatively, one could use first class 
> modules instead of records.)  There is some overhead related to the 
> dynamic dispatch, but I'm not convinced this would be unacceptable. 

The overhead is quite low. If you need to call e.g. "get" several times,
you could factor out the dictionary lookup:

let get = stringlike.get in ...

The (only) price is that the access cannot be inlined anymore.

> Your third idea (using char bigarrays) would then fit nicely in this 
> approach.

Right, and it would even be possible to use that for other buffer
representations (e.g. I have a non-contiguous buffer type called
Netpagebuffer in OCamlnet that could also be compatible with stringlike;
also think of ring buffers). It's a really nice idea.

The downside is that I cannot imagine any easy way to support this in C
interfaces. Well, you could have

low_level_buffer : 'a -> (Obj.t * int * int)

that gets you a base address, an offset, and a length, but that could be
too optimistic. Maybe C interfaces should simply dynamically check
whether 'a is a string or bigarray, and fail otherwise. These dynamic
checks are at least possible (maybe there could be a caml_stringlike_val
function that does its very best).

Another downside of this approach is that it introduces a lot of type
variables.

> Another direction would be to support also the case of functions which 
> can return either a bytes or a string.  A typical case is Bytes.sub / 
> Bytes.sub_string.  One could also want Bytes.cat_to_string: bytes -> 
> bytes -> string in addition to Bytes.cat: bytes -> bytes -> bytes.  For 
> those cases, one could introduce a GADT such as:
> 
>   type _ is_a_string =
>      | String: string is_a_string
>      | Bytes: bytes is_a_string
>      (* potentially more cases *)
> 
> You could then pass the desired constructor to the functions, e.g.: 
> Bytes.sub: bytes -> int -> int -> 'a is_a_string -> 'a.  The cost of 
> dispatching on the constructor is tiny, and the stdlib could bypass the 
> test altogether using unsafe features internally.  Higher-level 
> functions which can return either a string or a bytes are likely to 
> produce the result by passing the is_a_string value down to lower-level 
> functions.

That's also a nice idea, and it will definitely save a few string copies
here and there.

>   But one could also imagine that some function behave 
> differently according to the actual type of result.  For instance, a 
> function which is likely to produce often the same strings could decide 
> to keep a (weak) table of already returned strings, or to have a 
> hard-coded list of common strings; this works only for immutable 
> results, and so the function needs to check the is_a_string constructor 
> to enable/disable these optimizations.  The "stringlike" idea could also 
> be replaced by this is_a_string GADT, so that there could be a single 
> function:
> 
>   val sub: 'a is_a_string -> 'a -> int -> int -> 'b is_a_string -> 'b
> 
> 
> All that said, I think the current situation is already a net 
> improvement over the previous one, 

Well, I wouldn't say so because I'm missing good migration paths for
some important cases.

> and that further layers can be built 
> on top of it, if needed (and not necessarily in stdlib).

Well, as pointed out, I'd really like to see one such layer in stdlib,
because we'll otherwise have five different solutions in the library
scene which are all incompatible to each other. (Your type class
suggestion looks easy and will already solve most of the issues; why not
just include it into the stdlib, it wouldn't need much: a new module
Stringlike defining it, the records for String and Bytes and maybe char
Bigarrays, and some extensions here and there where it is used, e.g. in
Lexing.) IMHO, it is important to really provide practical solutions,
and not only to theoretically have one.

Gerd



> 
> Alain
> 
> 
> 
> On 07/04/2014 09:18 PM, Gerd Stolpmann wrote:
> > Hi list,
> >
> > I've just posted a blog article where I criticize the new concept of
> > immutable strings that will be available in OCaml 4.02 (as option):
> >
> > http://blog.camlcity.org/blog/bytes1.html
> >
> > In short my point is that it the new concept is not far reaching enough,
> > and will even have negative impact on the code quality when it is not
> > improved. I also present three ideas how to improve it.
> >
> > Gerd
> >
> 
> 

-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
My OCaml site:          http://www.camlcity.org
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-05 11:24   ` Gerd Stolpmann
@ 2014-07-08 13:23     ` Jacques Garrigue
  2014-07-08 13:37       ` Alain Frisch
  0 siblings, 1 reply; 29+ messages in thread
From: Jacques Garrigue @ 2014-07-08 13:23 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: OCaML List Mailing

On 2014/07/05 20:24, Gerd Stolpmann wrote:
> 
> Am Freitag, den 04.07.2014, 17:01 -0400 schrieb Markus Mottl:
>> I agree that the new concept has some noteworthy downsides as
>> demonstrated in the Lexing-example.  Your proposed solution 2
>> (stringlike) would probably solve these issues from a safety point of
>> view.  The downside is that the complexity of string-handling would
>> increase even more, because then we would have three types to deal
>> with.  I personally prefer safety over convenience, but other people's
>> (especially beginner's) mileage may vary.
> 
> Well, the complexity can be reduced a bit by using phantom types:
> 
> type string = [`String] stringlike
> type bytes = [`Bytes] stringlike
> 
> and then just define function-by-function what is permitted:
> 
> val get : 'a stringlike -> int -> char
> val set : [`Bytes] stringlike -> int -> char -> unit
> val sub : 'a stringlike -> int -> int -> [`String] stringlike
> val sub_bytes : 'a stringlike -> int -> int -> [`Bytes] stringlike
> 
> etc., and the modules String and Bytes would just contain aliases of
> these functions with monomorphed typing.
> 
> I don't know, though, whether we can be safe to never see the
> polymorphic typing when just using string and bytes. It would be a bit
> surprising for beginners to see that, and you sometimes would have to
> deal with unresolved type variables.

Indeed. Originally the plan was to use the above scheme for strings,
and use polymorphism to allow more flexibility. However, this is not
100% compatible, even if we allow to ignore the parameters, because
of these unresolved type variables. This also becomes complicated
when you want to take functions as parameters.

The stringlike type itself is a good idea.
In the standard library, it could be implemented as:
   type string = private stringlike
   type bytes = private stringlike
However, it is only about allowing passing string and bytes arguments
to functions in an homogeneous way.
For the return case, the situation is more confused, because returning
a stringlike is actually weaker than either bytes or string.
Alain’s idea of using an extra type-only parameter (‘a is_a_type) works,
and it doesn’t really need to be a GADT.
But this is a bit strange to use an extra parameter where a phantom type
on string itself would solve the problem. I.e., using your above approach
one can be safe just writing:
  val copy : ‘a stringlike -> ‘b stringlike
  val sub : ‘a stringlike -> int -> int -> ‘b stringlike
(assuming that we are always copying in sub too)

One could try to mix the two approaches: i.e. have a type ‘a stringlike,
with explicit coercions to and from bytes and string.
Note that you can do that yourself: create your own Stringlike module,
with the coercions
   type ’a stringlike
   external from_string : string -> [> `String] stringlike = “%identity"
   external to_string : [`String] stringlike -> string = “%identity”
   …
Note that you should not write “type +’a stringlike”, since you want to exploit
the fact any stringlike must be monomorphic.
This could of course be added to the standard library, but for compatibility
reasons I think that string itself has to stay as an abstract (or private) type with
no parameter. And the above kind of coercions is compiled away, so if your
goal is performance this should not be a problem.

Jacques

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-08 13:23     ` Jacques Garrigue
@ 2014-07-08 13:37       ` Alain Frisch
  2014-07-08 14:04         ` Jacques Garrigue
  0 siblings, 1 reply; 29+ messages in thread
From: Alain Frisch @ 2014-07-08 13:37 UTC (permalink / raw)
  To: Jacques Garrigue, Gerd Stolpmann; +Cc: OCaML List Mailing

On 07/08/2014 03:23 PM, Jacques Garrigue wrote:
> Alain’s idea of using an extra type-only parameter (‘a is_a_type) works,
> and it doesn’t really need to be a GADT.
> But this is a bit strange to use an extra parameter where a phantom type
> on string itself would solve the problem.

I mentioned that some functions could behave differently according to 
the requested result type.  For instance, a function

  val of_bool: 'a is_a_string -> bool -> 'a

would return string literals when 'a = String and it would copy them 
when 'a = Bytes.  Similarly, a function could memoize some strings it 
produces in order to return them later again, but only when 'a = String, 
not 'a = Bytes.

Even for functions such as "copy" or "sub", it makes sense to avoid a 
copy in some cases (when both the input and the output are immutable, 
and for sub, when the range covers the entire input).

So I don't think that "'a is_a_string" can really be only a phantom type.

-- Alain

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-08 13:37       ` Alain Frisch
@ 2014-07-08 14:04         ` Jacques Garrigue
  0 siblings, 0 replies; 29+ messages in thread
From: Jacques Garrigue @ 2014-07-08 14:04 UTC (permalink / raw)
  To: Alain Frisch; +Cc: Mailing List OCaml, Gerd Stolpmann

[-- Attachment #1: Type: text/plain, Size: 1491 bytes --]

2014/07/08 22:38 "Alain Frisch" <alain@frisch.fr>:
>
> On 07/08/2014 03:23 PM, Jacques Garrigue wrote:
>>
>> Alain’s idea of using an extra type-only parameter (‘a is_a_type) works,
>> and it doesn’t really need to be a GADT.
>> But this is a bit strange to use an extra parameter where a phantom type
>> on string itself would solve the problem.
>
>
> I mentioned that some functions could behave differently according to the
requested result type.  For instance, a function
>
>  val of_bool: 'a is_a_string -> bool -> 'a
>
> would return string literals when 'a = String and it would copy them when
'a = Bytes.  Similarly, a function could memoize some strings it produces
in order to return them later again, but only when 'a = String, not 'a =
Bytes.

I see. But in that case we could also have different functions, since the
semantics change (at least for physical equality)

> Even for functions such as "copy" or "sub", it makes sense to avoid a
copy in some cases (when both the input and the output are immutable, and
for sub, when the range covers the entire input).

Ok, but in that case you will need a flag for both input and output
strings, since there is no way to recover this information from the string
itself.

> So I don't think that "'a is_a_string" can really be only a phantom type.

I see.
I think that both approaches have interesting applications.
But from a type system point of view they are clearly advanced.

Jacques

[-- Attachment #2: Type: text/html, Size: 1850 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-04 19:18 [Caml-list] Immutable strings Gerd Stolpmann
                   ` (2 preceding siblings ...)
  2014-07-07 12:42 ` Alain Frisch
@ 2014-07-08 18:15 ` mattiasw
  2014-07-08 19:24   ` Daniel Bünzli
                     ` (2 more replies)
  2014-07-21 15:06 ` Alain Frisch
  4 siblings, 3 replies; 29+ messages in thread
From: mattiasw @ 2014-07-08 18:15 UTC (permalink / raw)
  To: caml-list

My two cents:

To me it seems very strange to introduce a new string type and not make it
UTF-8 from start.

ocaml will be that last language that doesn't have standardize unicode
support. Even old languages like Erlang has gone the UTF-8 way, and that
includes program code.

Bytes and strings have nothing in common, but str.[4] is still relevant for
UTF-8 strings. The algorithm is slighly more complicated.

I converted a big ocaml program to F# and the immutable strings was the
smallest problem, since detected by the compiler.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-08 18:15 ` mattiasw
@ 2014-07-08 19:24   ` Daniel Bünzli
  2014-07-08 19:27     ` Raoul Duke
  2014-07-09 14:15   ` Daniel Bünzli
  2014-07-14 17:45   ` Richard W.M. Jones
  2 siblings, 1 reply; 29+ messages in thread
From: Daniel Bünzli @ 2014-07-08 19:24 UTC (permalink / raw)
  To: mattiasw; +Cc: caml-list

Le mardi, 8 juillet 2014 à 19:15, mattiasw@gmail.com a écrit :
> My two cents:
>  
> To me it seems very strange to introduce a new string type and not make it
> UTF-8 from start.

No new string type was introduced. A bytes type was introduced.  
  
> ocaml will be that last language that doesn't have standardize unicode
> support.  

What do you mean by standarized unicode support in the language *exactly* ?  

I'd be genuinely interested in knowing the actual real level of support for Unicode in these language, beyond saying our string is an UTF-X encoded sequence of scalar values. For example do these other language do perform Unicode normalisation on string literals/patterns (and identifiers if they choose that craze) ? This for example would be absolutely necessary to have for performing any kind of real world processing on unicode strings, but then there's not only a single normalisation form and the one you want depends on the context. Do they have a notation to indicate in which form they want the literal/pattern to be ?  

> Even old languages like Erlang has gone the UTF-8 way, and that
> includes program code.

For a very very very very long time it has been possible to write, unnormalized or normalized according to the normal form your editor, UTF-8 encoded literals in your OCaml sources; you just had to drop the idea of using latin1 identifiers, which are now anyway deprecated since 4.01.  

As for being able to write Unicode *identifiers* in the language I'm actually quite glad OCaml hasn't that, there are both too many arrow characters to use in Unicode and too many unreasonable programmers out there.
  
> Bytes and strings have nothing in common, but str.[4] is still relevant for
> UTF-8 strings.  

Direct indexing is rarely relevant in Unicode as usually you want those indexes to correspond to user perceived characters (e.g. to align things in text formatting) and user perceived characters may be written as a sequence of unicode scalar value… or not (even in normal forms, since an arbitrary number of combining character can be applied to a base character). The unicode segmentation algorithm allows you to find these boundaries, simple indexing doesn't and is mostly worthless in Unicode processing.

Best,

Daniel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-08 19:24   ` Daniel Bünzli
@ 2014-07-08 19:27     ` Raoul Duke
  0 siblings, 0 replies; 29+ messages in thread
From: Raoul Duke @ 2014-07-08 19:27 UTC (permalink / raw)
  To: OCaml

ja wohl, n'est pas, это жизнь, based on my experiences with strings
and stuff over the years, i resonate with what Daniel posted. :-)
things like UTF-whatever are baseline requirements, but beyond that
(a) nobody has it right (b) unicode sucks. :-)

On Tue, Jul 8, 2014 at 12:24 PM, Daniel Bünzli
<daniel.buenzli@erratique.ch> wrote:
> Le mardi, 8 juillet 2014 à 19:15, mattiasw@gmail.com a écrit :
>> My two cents:
>>
>> To me it seems very strange to introduce a new string type and not make it
>> UTF-8 from start.
>
> No new string type was introduced. A bytes type was introduced.
>
>> ocaml will be that last language that doesn't have standardize unicode
>> support.
>
> What do you mean by standarized unicode support in the language *exactly* ?
>
> I'd be genuinely interested in knowing the actual real level of support for Unicode in these language, beyond saying our string is an UTF-X encoded sequence of scalar values. For example do these other language do perform Unicode normalisation on string literals/patterns (and identifiers if they choose that craze) ? This for example would be absolutely necessary to have for performing any kind of real world processing on unicode strings, but then there's not only a single normalisation form and the one you want depends on the context. Do they have a notation to indicate in which form they want the literal/pattern to be ?
>
>> Even old languages like Erlang has gone the UTF-8 way, and that
>> includes program code.
>
> For a very very very very long time it has been possible to write, unnormalized or normalized according to the normal form your editor, UTF-8 encoded literals in your OCaml sources; you just had to drop the idea of using latin1 identifiers, which are now anyway deprecated since 4.01.
>
> As for being able to write Unicode *identifiers* in the language I'm actually quite glad OCaml hasn't that, there are both too many arrow characters to use in Unicode and too many unreasonable programmers out there.
>
>> Bytes and strings have nothing in common, but str.[4] is still relevant for
>> UTF-8 strings.
>
> Direct indexing is rarely relevant in Unicode as usually you want those indexes to correspond to user perceived characters (e.g. to align things in text formatting) and user perceived characters may be written as a sequence of unicode scalar value… or not (even in normal forms, since an arbitrary number of combining character can be applied to a base character). The unicode segmentation algorithm allows you to find these boundaries, simple indexing doesn't and is mostly worthless in Unicode processing.
>
> Best,
>
> Daniel
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-08 12:24   ` Gerd Stolpmann
@ 2014-07-09 13:54     ` Alain Frisch
  2014-07-09 18:04       ` Gerd Stolpmann
  2014-07-14 17:40       ` Richard W.M. Jones
  0 siblings, 2 replies; 29+ messages in thread
From: Alain Frisch @ 2014-07-09 13:54 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: caml-list

On 07/08/2014 02:24 PM, Gerd Stolpmann wrote:
> It will create confusion even with actively maintained code bases. What
> could help here is very clear communication when the change will be the
> standard behavior, and how the migration will take place.

It's a very different kind of criticism from your initial point about 
the decision of going into the current direction.  Point taken: the 
development team will need to communicate about the expected timeline 
and migrate path.  But note that 4.02 is not even out, and since the 
default behavior is the previous one, there is no hurry, and it's fine 
if people wait a few months before trying the new mode.   It doesn't 
seem crazy to wait for some early user feedback and synchronize with 
them before deciding on a more precise plan for the wider community. For 
instance, you feedback about porting ocamlnet is quite useful and the 
current discussion shows that several solutions compete and need further 
thought.  Without the new compiler switch, this discussion would not 
have taken place.

> Right, that's the good side of it. (Although the danger is quite
> theoretical, as most programmers seem to intuitively follow the rule
> "don't mutate strings you did not create". I've never seen this kind of
> bug in practice.)

Still, library functions such as string_of_bool, or string_of_format (in 
the previous version) had to be written carefully, with extra copies, to 
avoid public humiliation (or not).

> I think it would be quite important to have that in the stdlib:
>
>   - This sets a standard for interoperability between libraries
>   - The stdlib can exploit the details of the representation
>   - It would be possible to use stringlike directly in C interfaces

Note that if it goes to stdlib, one cannot refer to bigarrays.  (One 
might want to have bigarrays in stdlib, but we are not there yet.)

-- Alain

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-08 18:15 ` mattiasw
  2014-07-08 19:24   ` Daniel Bünzli
@ 2014-07-09 14:15   ` Daniel Bünzli
  2014-07-14 17:45   ` Richard W.M. Jones
  2 siblings, 0 replies; 29+ messages in thread
From: Daniel Bünzli @ 2014-07-09 14:15 UTC (permalink / raw)
  To: caml-list

Le mardi, 8 juillet 2014 à 19:15, mattiasw@gmail.com a écrit :
> ocaml will be that last language that doesn't have standardize unicode
> support. Even old languages like Erlang has gone the UTF-8 way, and that
> includes program code.

For the fun I just had a look what python does.  

So in python basically they have a Unicode string which is a string made of Unicode *code points*. Fail, end of discussion. Should have been: *scalar values* (for those who don't understand why, I suggest reading my minimal Unicode introduction [1]).

(both in 2 and 3, apparently 2 used to be messier for reason I didn't bother to understand, they seem to be highly confused)

Sample code. U+D800 is the first surrogate, i.e. something you should never see in concrete Unicode textual processing, only in UTF-16 encoded bytes and paired with an appropriate low surrogate.

Python2:

>>> u'\uD800'.encode('utf-8')
'\xed\xa0\x80'

Congratulations, you just produced an invalid UTF-8 sequence (serialized a surrogate).  

Python3 is a *little* better with *UTF-8* (but wait…) encoding stuff

>>> "\uD800".encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed


So now let's try UTF-16:

>>> "\uD800".encode("utf-16")
b'\xff\xfe\x00\xd8'


Congratulations you just produced an invalid UTF-16 sequence hi-surrogate without a corresponding low surrogate (which together would define an Unicode scalar value).

Why on earth do they allow to represent surrogates *at all* in their Unicode text data structure ? Basically they don't understand Unicode.  

The old camel should not be ashamed of its *outsanding* (absolutely) unicode support — this is not to say that nothing can be improved, I do have some proposal in the works — but the situation is not bad either.

Best,

Daniel

P.S. Skimming through these articles about python unicode strings I gather why people find unicode hard, there seem to be a high level of both technical and conceptual confusion. Again have a read at [1] if you'd like to clear (I hope) your mind about these things.


[1] http://erratique.ch/software/uucp/doc/Uucp.html#uminimal





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-09 13:54     ` Alain Frisch
@ 2014-07-09 18:04       ` Gerd Stolpmann
  2014-07-10  6:41         ` Nicolas Boulay
  2014-07-14 17:40       ` Richard W.M. Jones
  1 sibling, 1 reply; 29+ messages in thread
From: Gerd Stolpmann @ 2014-07-09 18:04 UTC (permalink / raw)
  To: Alain Frisch; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 3082 bytes --]

Am Mittwoch, den 09.07.2014, 15:54 +0200 schrieb Alain Frisch:
> On 07/08/2014 02:24 PM, Gerd Stolpmann wrote:
> > It will create confusion even with actively maintained code bases. What
> > could help here is very clear communication when the change will be the
> > standard behavior, and how the migration will take place.
> 
> It's a very different kind of criticism from your initial point about 
> the decision of going into the current direction.

Right, but the question how the user process will look like is just the
next one. The design of the change so far is minimalistic, and it is
obvious that some abstraction is missing, and my only explanation is
that there wasn't a consensus in the OCaml team (but that's just a wild
guess). I don't want to say that the OCaml team is ignoring any
problems, but it looks like the missing abstraction is somehow offloaded
to the users, namely whether it is needed at all in the stdlib (maybe
nobody is complaining), or which style is preferred. (I just want to say
that there is IMHO a connection between the minimalistic design, and the
social embedding.)

>   Point taken: the 
> development team will need to communicate about the expected timeline 
> and migrate path.  But note that 4.02 is not even out, and since the 
> default behavior is the previous one, there is no hurry, and it's fine 
> if people wait a few months before trying the new mode.

My thinking here is that 95% of the users will have no problems at all
when they convert their programs. It's the other 5% for which the
current design is not really sufficient. Let's just hope these users
aren't immediately discouraged when they find it out.

>    It doesn't 
> seem crazy to wait for some early user feedback and synchronize with 
> them before deciding on a more precise plan for the wider community. For 
> instance, you feedback about porting ocamlnet is quite useful and the 
> current discussion shows that several solutions compete and need further 
> thought.  Without the new compiler switch, this discussion would not 
> have taken place.

Fully agreed.

> > I think it would be quite important to have that in the stdlib:
> >
> >   - This sets a standard for interoperability between libraries
> >   - The stdlib can exploit the details of the representation
> >   - It would be possible to use stringlike directly in C interfaces
> 
> Note that if it goes to stdlib, one cannot refer to bigarrays.  (One 
> might want to have bigarrays in stdlib, but we are not there yet.)

Right, but this isn't a big deal. (Bigarray also uses Unix.file_descr,
but this dep is easy to work around by anchoring file_descr in
Pervasives.)


Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
My OCaml site:          http://www.camlcity.org
Contact details:        http://www.camlcity.org/contact.html
Company homepage:       http://www.gerd-stolpmann.de
------------------------------------------------------------


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-09 18:04       ` Gerd Stolpmann
@ 2014-07-10  6:41         ` Nicolas Boulay
  0 siblings, 0 replies; 29+ messages in thread
From: Nicolas Boulay @ 2014-07-10  6:41 UTC (permalink / raw)
  Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 3618 bytes --]

In one of my program, i parse lot of small files. Most of the time was
consume by the GC, because lot of string was created. File are read into
string then converted to data structure, but those string buffer are not
easly reusable for the next file. So the GC have a hard work.

Maybe ocaml need a way to enable the reuse of string to reduce the pressure
on the GC, and reduce the need of mutable string.

Regards,
Nicolas


2014-07-09 20:04 GMT+02:00 Gerd Stolpmann <info@gerd-stolpmann.de>:

> Am Mittwoch, den 09.07.2014, 15:54 +0200 schrieb Alain Frisch:
> > On 07/08/2014 02:24 PM, Gerd Stolpmann wrote:
> > > It will create confusion even with actively maintained code bases. What
> > > could help here is very clear communication when the change will be the
> > > standard behavior, and how the migration will take place.
> >
> > It's a very different kind of criticism from your initial point about
> > the decision of going into the current direction.
>
> Right, but the question how the user process will look like is just the
> next one. The design of the change so far is minimalistic, and it is
> obvious that some abstraction is missing, and my only explanation is
> that there wasn't a consensus in the OCaml team (but that's just a wild
> guess). I don't want to say that the OCaml team is ignoring any
> problems, but it looks like the missing abstraction is somehow offloaded
> to the users, namely whether it is needed at all in the stdlib (maybe
> nobody is complaining), or which style is preferred. (I just want to say
> that there is IMHO a connection between the minimalistic design, and the
> social embedding.)
>
> >   Point taken: the
> > development team will need to communicate about the expected timeline
> > and migrate path.  But note that 4.02 is not even out, and since the
> > default behavior is the previous one, there is no hurry, and it's fine
> > if people wait a few months before trying the new mode.
>
> My thinking here is that 95% of the users will have no problems at all
> when they convert their programs. It's the other 5% for which the
> current design is not really sufficient. Let's just hope these users
> aren't immediately discouraged when they find it out.
>
> >    It doesn't
> > seem crazy to wait for some early user feedback and synchronize with
> > them before deciding on a more precise plan for the wider community. For
> > instance, you feedback about porting ocamlnet is quite useful and the
> > current discussion shows that several solutions compete and need further
> > thought.  Without the new compiler switch, this discussion would not
> > have taken place.
>
> Fully agreed.
>
> > > I think it would be quite important to have that in the stdlib:
> > >
> > >   - This sets a standard for interoperability between libraries
> > >   - The stdlib can exploit the details of the representation
> > >   - It would be possible to use stringlike directly in C interfaces
> >
> > Note that if it goes to stdlib, one cannot refer to bigarrays.  (One
> > might want to have bigarrays in stdlib, but we are not there yet.)
>
> Right, but this isn't a big deal. (Bigarray also uses Unix.file_descr,
> but this dep is easy to work around by anchoring file_descr in
> Pervasives.)
>
>
> Gerd
> --
> ------------------------------------------------------------
> Gerd Stolpmann, Darmstadt, Germany    gerd@gerd-stolpmann.de
> My OCaml site:          http://www.camlcity.org
> Contact details:        http://www.camlcity.org/contact.html
> Company homepage:       http://www.gerd-stolpmann.de
> ------------------------------------------------------------
>
>

[-- Attachment #2: Type: text/html, Size: 4686 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-09 13:54     ` Alain Frisch
  2014-07-09 18:04       ` Gerd Stolpmann
@ 2014-07-14 17:40       ` Richard W.M. Jones
  1 sibling, 0 replies; 29+ messages in thread
From: Richard W.M. Jones @ 2014-07-14 17:40 UTC (permalink / raw)
  To: Alain Frisch; +Cc: Gerd Stolpmann, caml-list

On Wed, Jul 09, 2014 at 03:54:57PM +0200, Alain Frisch wrote:
> On 07/08/2014 02:24 PM, Gerd Stolpmann wrote:
> >It will create confusion even with actively maintained code bases. What
> >could help here is very clear communication when the change will be the
> >standard behavior, and how the migration will take place.
> 
> It's a very different kind of criticism from your initial point
> about the decision of going into the current direction.  Point
> taken: the development team will need to communicate about the
> expected timeline and migrate path.  But note that 4.02 is not even
> out, and since the default behavior is the previous one, there is no
> hurry, and it's fine if people wait a few months before trying the
> new mode.   It doesn't seem crazy to wait for some early user
> feedback and synchronize with them before deciding on a more precise
> plan for the wider community. For instance, you feedback about
> porting ocamlnet is quite useful and the current discussion shows
> that several solutions compete and need further thought.  Without
> the new compiler switch, this discussion would not have taken place.

The problem we may* have is that we have to support OCaml back to ~
3.10 from the same code base.

Rich.

* I say `may' in that sentence because I've just ignored the warnings
so far -- having much bigger problems with armv7hl & aarch64 support
in 4.02 right now.

-- 
Richard Jones
Red Hat

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-08 18:15 ` mattiasw
  2014-07-08 19:24   ` Daniel Bünzli
  2014-07-09 14:15   ` Daniel Bünzli
@ 2014-07-14 17:45   ` Richard W.M. Jones
  2 siblings, 0 replies; 29+ messages in thread
From: Richard W.M. Jones @ 2014-07-14 17:45 UTC (permalink / raw)
  To: caml-list

On Tue, Jul 08, 2014 at 08:15:39PM +0200, mattiasw@gmail.com wrote:
> My two cents:
> 
> To me it seems very strange to introduce a new string type and not make it
> UTF-8 from start.

I would far prefer that OCaml did *not* specify an encoding for
string, and just left them as effectively array of bytes as now.  This
leaves the business of encoding UTF-8 up to higher layers, either
camomile, iconv or the database.

That would imply removing incorrect functions like String.uppercase
and String.lowercase.

There are a couple of reasons for this:

(1) It's easy to get Unicode wrong, and baking incorrect Unicode into
the language could be worse than not having it at all.  See also:
Java, Python 2, Ruby, everything using the Win32 API.

(2) Doing it right is incredibly complex.  See also: Perl 5.

Rich.

-- 
Richard Jones
Red Hat

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-05 11:04   ` Gerd Stolpmann
@ 2014-07-16 11:38     ` Damien Doligez
  0 siblings, 0 replies; 29+ messages in thread
From: Damien Doligez @ 2014-07-16 11:38 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 3117 bytes --]

Hi Gerd and OCaml users,


First note that we are not breaking backward compatibility: you can
always use the -unsafe-string flag to compile your dusty code.


On 2014-07-05, at 13:04, Gerd Stolpmann wrote:

> So my scenario is quite low-level: I/O, and C interfaces.

As you said, bigarrays are the best suited for that kind of code.
But that's not a good reason to make all strings as heavy as bigarrays.
If you need bigarrays, by all means use bigarrays in your code, not
String or Bytes.


On 2014-07-05, at 13:24, Gerd Stolpmann wrote:

> Well, the complexity can be reduced a bit by using phantom types:
> 
> type string = [`String] stringlike
> type bytes = [`Bytes] stringlike
> 
> and then just define function-by-function what is permitted:

This is almost the same as our first version, which we discarded as
too complex and not compatible enough (as you noted, because of
unresolved type variables). But it might make a come-back.


On 2014-07-08, at 14:24, Gerd Stolpmann wrote:

> It will create confusion even with actively maintained code bases. What
> could help here is very clear communication when the change will be the
> standard behavior, and how the migration will take place. Currently, it
> feels like a big experiment - hey, let's users tentatively enable it,
> and watch out for problems.

OK, we need to be clearer on the "how" (in an nutshell, the default will
switch from -unsafe-string to -safe-string at some point in the future
when we feel that enough of the existing code has been updated).
As for the "when", we can't tell because that depends a lot on how fast
the community updates its code. Hopefully no more than three years.
Possibly as soon as 4.03.0.

> There could also be a section in
> the manual explaining the new behavior, and how to convert code.

That's a good idea.

> Right, that's the good side of it. (Although the danger is quite
> theoretical, as most programmers seem to intuitively follow the rule
> "don't mutate strings you did not create". I've never seen this kind of
> bug in practice.)

What about programmers who deliberately trigger the bug (aka "attackers",
in a security setting)? It's not just about how unlikely a bug is, but
also whether it can be exploited.

> For instance, there is one module in OCamlnet where a regexp is directly
> run on an I/O buffer (generally, you need to do some primitive parsing
> on I/O buffers before you can extract strings, and that's where
> stringlike would be nice to have). Without stringlike, I would have to
> replace that regexp somehow.

If stringlike is polymorphic, you will need a new regexp library that
operates on stringlike. We cannot update the current regexp library to
use stringlike because that would introduce polymorphism and unresolved
type variables, and that might break some of the code that used to run
on 1.03...


On 2014-07-14, at 19:45, Richard W.M. Jones wrote:

> That would imply removing incorrect functions like String.uppercase
> and String.lowercase.

First, we mark them deprecated. Then we wait a very long time before we
actually remove them from (if ever).

-- Damien


[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 630 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-04 19:18 [Caml-list] Immutable strings Gerd Stolpmann
                   ` (3 preceding siblings ...)
  2014-07-08 18:15 ` mattiasw
@ 2014-07-21 15:06 ` Alain Frisch
       [not found]   ` <20140722.235104.405798419265248505.Christophe.Troestler@umons.ac.be>
  4 siblings, 1 reply; 29+ messages in thread
From: Alain Frisch @ 2014-07-21 15:06 UTC (permalink / raw)
  To: Gerd Stolpmann, caml-list

On 07/04/2014 09:18 PM, Gerd Stolpmann wrote:
> http://blog.camlcity.org/blog/bytes1.html

Coming back to motivating example of this post.

Lexing provides:

val from_channel : in_channel -> lexbuf
val from_string : string -> lexbuf
val from_function : (bytes -> int -> int) -> lexbuf

In particular, from_function expects you to write to a buffer, so it's 
pretty clear that its callback must accept a "bytes", not a "string". 
There is no place for a (string -> int -> int) -> lexbuf function.

Concerning from_string: this function copies the string to an internal 
buffer.  This is purely implemented on the OCaml side without any unsafe 
features.  We could avoid this copy because we know that the generated 
lexers won't actually modify the buffer in that case, but it would be 
very difficult to do this without using an unsafe feature, even if we 
had some sort of generalization of bytes and string.  We would instead 
need a completely different implementation (which would not use 
"stringable" to make the "source" (string or "stream") explicit in the 
lexbuf datastructure.

We could also provide an extra from_bytes function, but it can currently 
be implemented by composing Bytes.to_string and Lexing.from_string.  Are 
you concerned only by the performance overhead of this approach (two 
copies)?  If so, the same argument would apply to the current 
implementation of from_string, and we would need to switch to a 
different approach, for which it's not clear that "stringable" would be 
a big help (see above).  Before doing anything like that, it would be 
interesting to evaluate the exact overhead.  It could very well be 
negligible/acceptable for most cases compared to the cost of actual lexing.

-- Alain

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-04 21:01 ` Markus Mottl
  2014-07-05 11:24   ` Gerd Stolpmann
@ 2014-07-28 11:14   ` Goswin von Brederlow
  2014-07-28 15:51     ` Markus Mottl
  1 sibling, 1 reply; 29+ messages in thread
From: Goswin von Brederlow @ 2014-07-28 11:14 UTC (permalink / raw)
  To: caml-list

On Fri, Jul 04, 2014 at 05:01:18PM -0400, Markus Mottl wrote:
> I agree that the new concept has some noteworthy downsides as
> demonstrated in the Lexing-example.  Your proposed solution 2
> (stringlike) would probably solve these issues from a safety point of
> view.  The downside is that the complexity of string-handling would
> increase even more, because then we would have three types to deal
> with.  I personally prefer safety over convenience, but other people's
> (especially beginner's) mileage may vary.
> 
> The Bigarray-approach doesn't seem appealing to me.  Strings are much
> more lightweight, since they can be allocated cheaply on the
> OCaml-heap.  E.g. String.create is about 10x-100x faster than
> Bigarray.create.  That seems too big to ignore.
> 
> Regards,
> Markus

Why is that? A bigarray allocates a small block on the ocaml heap and
the buffer outside the ocaml heap. Is that normal malloc() call just
so much slower? Or are there other factors involved?

On the other hand if your app is IO heavy then you should allocate a
few buffers and reuse them. In that case the allocation overhead is
constant and the time saved for not copying in the I/O will more than
make up for it.

Or read/mmap the file into a huge bigarray and the slice it into
smaller chunks.

> On Fri, Jul 4, 2014 at 3:18 PM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote:
> > Hi list,
> >
> > I've just posted a blog article where I criticize the new concept of
> > immutable strings that will be available in OCaml 4.02 (as option):
> >
> > http://blog.camlcity.org/blog/bytes1.html
> >
> > In short my point is that it the new concept is not far reaching enough,
> > and will even have negative impact on the code quality when it is not
> > improved. I also present three ideas how to improve it.
> >
> > Gerd

You have a few more points:

1) there are 3 kinds of strings:

- string literal / constant strings [which never change ever]
- read-only strings [which YOU are not allowed to change but might change]
- mutable strings [which you are allowed to changed]

There is one other thing you didn't mention here. While it is nice to
pass a mutable string to the lexer (or similar) one has to realize
that that is not thread save. Another thread might be mutating the
string while it is being used.

So I would suggest there is a 4th kind of string:

- frozen strings [which are mutable but won't be changed anymore]

That is basically like read-only strings but with the addes promise
that they won't be changed. Nothing in the type system garanties that,
it is just a promise from the programmer.

2) there are lots of functions that just need any kind of string and
   should accept all 3

This kind of asks for type classes. There should be a read-from-string
type class that all 3 string types would fit. Then one could have one
function accepting a read-from-string type class and all 3 string
types could be passed. But unfortunately ocaml doesn't have type
classes.

The next best thing would be enumerations (not in stdlib). Make
enumerations accept all 3 string types and then have everything else
accept enumerations. This would also mean you could pass a char list
or rope or any other type that gives you an enumeration of chars.

3) I/O code

That the stdlib uses strings for I/O and needs to copy the data around
all the time has been nagging me for years. There certainly should be
read/write functions dealing with bigarrays.

There also should be a function to create a bigarray with special
alignment (e.g. PAGESIZE) to get the best I/O performance (or in case
of linux async IO make it work at all).

As for mutable/immutable strings there should be a read function
returning an immutable string, which it creates internally. The string
can't be passed as argument so creating a fresh one is the only way.


Here is a completly new point:

4) What is good for strings is also good for bigarray

The same arguments concerning strings applies to bigarrays. Say you
pass a bigarray to the lexer. Can it just use it as is for its lexbuf
or does it need to copy it because it might mutate? An immutable
bigarray could be used savely as is.


And this doesn't realy stop at bigarray. Even references could be
read-only, in the sense of "this might change but YOU aren't allowed
to change it". And I think the only way to solve the
const/immutable/mutable/frozen sub-types that will be applicable to
more than just string is to use phantom types.

MfG
	Goswin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-28 11:14   ` Goswin von Brederlow
@ 2014-07-28 15:51     ` Markus Mottl
  2014-07-29  2:54       ` Yaron Minsky
  0 siblings, 1 reply; 29+ messages in thread
From: Markus Mottl @ 2014-07-28 15:51 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: caml-list

On Mon, Jul 28, 2014 at 7:14 AM, Goswin von Brederlow <goswin-v-b@web.de> wrote:
> Why is that? A bigarray allocates a small block on the ocaml heap and
> the buffer outside the ocaml heap. Is that normal malloc() call just
> so much slower? Or are there other factors involved?

If you look at the runtime code, you'll see that there is quite a lot
going on to create a bigarray value.  Allocating small OCaml-strings
on the minor heap only costs a handful of cheap instructions, which is
obviously way more efficient.  There is some threshold at which malloc
will perform more expensive system calls to obtain memory whereas
OCaml may still be able to get some larger chunks from the major heap.
Unless Bigarrays become really large, standard OCaml strings can be
obtained much more cheaply.

> On the other hand if your app is IO heavy then you should allocate a
> few buffers and reuse them. In that case the allocation overhead is
> constant and the time saved for not copying in the I/O will more than
> make up for it.

Exactly.  Bigarrays are my buffer of choice for I/O.

> Or read/mmap the file into a huge bigarray and the slice it into
> smaller chunks.

This can improve performance for certain operations, but beware of
page faults when accessing ranges that only reside on disk.  Unless
this access is done outside of the OCaml-lock, your application could
freeze longer than allowed for realtime applications.

Regards,
Markus

>> On Fri, Jul 4, 2014 at 3:18 PM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote:
>> > Hi list,
>> >
>> > I've just posted a blog article where I criticize the new concept of
>> > immutable strings that will be available in OCaml 4.02 (as option):
>> >
>> > http://blog.camlcity.org/blog/bytes1.html
>> >
>> > In short my point is that it the new concept is not far reaching enough,
>> > and will even have negative impact on the code quality when it is not
>> > improved. I also present three ideas how to improve it.
>> >
>> > Gerd
>
> You have a few more points:
>
> 1) there are 3 kinds of strings:
>
> - string literal / constant strings [which never change ever]
> - read-only strings [which YOU are not allowed to change but might change]
> - mutable strings [which you are allowed to changed]
>
> There is one other thing you didn't mention here. While it is nice to
> pass a mutable string to the lexer (or similar) one has to realize
> that that is not thread save. Another thread might be mutating the
> string while it is being used.
>
> So I would suggest there is a 4th kind of string:
>
> - frozen strings [which are mutable but won't be changed anymore]
>
> That is basically like read-only strings but with the addes promise
> that they won't be changed. Nothing in the type system garanties that,
> it is just a promise from the programmer.
>
> 2) there are lots of functions that just need any kind of string and
>    should accept all 3
>
> This kind of asks for type classes. There should be a read-from-string
> type class that all 3 string types would fit. Then one could have one
> function accepting a read-from-string type class and all 3 string
> types could be passed. But unfortunately ocaml doesn't have type
> classes.
>
> The next best thing would be enumerations (not in stdlib). Make
> enumerations accept all 3 string types and then have everything else
> accept enumerations. This would also mean you could pass a char list
> or rope or any other type that gives you an enumeration of chars.
>
> 3) I/O code
>
> That the stdlib uses strings for I/O and needs to copy the data around
> all the time has been nagging me for years. There certainly should be
> read/write functions dealing with bigarrays.
>
> There also should be a function to create a bigarray with special
> alignment (e.g. PAGESIZE) to get the best I/O performance (or in case
> of linux async IO make it work at all).
>
> As for mutable/immutable strings there should be a read function
> returning an immutable string, which it creates internally. The string
> can't be passed as argument so creating a fresh one is the only way.
>
>
> Here is a completly new point:
>
> 4) What is good for strings is also good for bigarray
>
> The same arguments concerning strings applies to bigarrays. Say you
> pass a bigarray to the lexer. Can it just use it as is for its lexbuf
> or does it need to copy it because it might mutate? An immutable
> bigarray could be used savely as is.
>
>
> And this doesn't realy stop at bigarray. Even references could be
> read-only, in the sense of "this might change but YOU aren't allowed
> to change it". And I think the only way to solve the
> const/immutable/mutable/frozen sub-types that will be applicable to
> more than just string is to use phantom types.
>
> MfG
>         Goswin
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs



-- 
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-28 15:51     ` Markus Mottl
@ 2014-07-29  2:54       ` Yaron Minsky
  2014-07-29  9:46         ` Goswin von Brederlow
  2014-07-29 11:48         ` John F. Carr
  0 siblings, 2 replies; 29+ messages in thread
From: Yaron Minsky @ 2014-07-29  2:54 UTC (permalink / raw)
  To: Markus Mottl; +Cc: Goswin von Brederlow, caml-list

This isn't my idea, but it seems worth repeating: perhaps it would
make sense to have an unmovable byte-array type that had the same
memory representation as Bytes.t, with the extra guarantee that the
collector wouldn't move it.

You could imagine representing this as a private type:

module Immovable_bytes : sig
   type t = private Bytes.t
   val create : int -> t
end

with a special creation function for creating these immovable strings.
This would avoid some of the current need to write what is effectively
the same code twice, once for bigarrays and once for Bytes.t's.  In
particular, you could modify an Immovable_bytes by first up-casting it
to a Bytes.t.  But you could only actually create one by going through
the special creation function.

I'm not sure if the runtime details could be made to work out, but if
they could, I think it would be a bit nicer than the current world.

y

On Mon, Jul 28, 2014 at 11:51 AM, Markus Mottl <markus.mottl@gmail.com> wrote:
> On Mon, Jul 28, 2014 at 7:14 AM, Goswin von Brederlow <goswin-v-b@web.de> wrote:
>> Why is that? A bigarray allocates a small block on the ocaml heap and
>> the buffer outside the ocaml heap. Is that normal malloc() call just
>> so much slower? Or are there other factors involved?
>
> If you look at the runtime code, you'll see that there is quite a lot
> going on to create a bigarray value.  Allocating small OCaml-strings
> on the minor heap only costs a handful of cheap instructions, which is
> obviously way more efficient.  There is some threshold at which malloc
> will perform more expensive system calls to obtain memory whereas
> OCaml may still be able to get some larger chunks from the major heap.
> Unless Bigarrays become really large, standard OCaml strings can be
> obtained much more cheaply.
>
>> On the other hand if your app is IO heavy then you should allocate a
>> few buffers and reuse them. In that case the allocation overhead is
>> constant and the time saved for not copying in the I/O will more than
>> make up for it.
>
> Exactly.  Bigarrays are my buffer of choice for I/O.
>
>> Or read/mmap the file into a huge bigarray and the slice it into
>> smaller chunks.
>
> This can improve performance for certain operations, but beware of
> page faults when accessing ranges that only reside on disk.  Unless
> this access is done outside of the OCaml-lock, your application could
> freeze longer than allowed for realtime applications.
>
> Regards,
> Markus
>
>>> On Fri, Jul 4, 2014 at 3:18 PM, Gerd Stolpmann <info@gerd-stolpmann.de> wrote:
>>> > Hi list,
>>> >
>>> > I've just posted a blog article where I criticize the new concept of
>>> > immutable strings that will be available in OCaml 4.02 (as option):
>>> >
>>> > http://blog.camlcity.org/blog/bytes1.html
>>> >
>>> > In short my point is that it the new concept is not far reaching enough,
>>> > and will even have negative impact on the code quality when it is not
>>> > improved. I also present three ideas how to improve it.
>>> >
>>> > Gerd
>>
>> You have a few more points:
>>
>> 1) there are 3 kinds of strings:
>>
>> - string literal / constant strings [which never change ever]
>> - read-only strings [which YOU are not allowed to change but might change]
>> - mutable strings [which you are allowed to changed]
>>
>> There is one other thing you didn't mention here. While it is nice to
>> pass a mutable string to the lexer (or similar) one has to realize
>> that that is not thread save. Another thread might be mutating the
>> string while it is being used.
>>
>> So I would suggest there is a 4th kind of string:
>>
>> - frozen strings [which are mutable but won't be changed anymore]
>>
>> That is basically like read-only strings but with the addes promise
>> that they won't be changed. Nothing in the type system garanties that,
>> it is just a promise from the programmer.
>>
>> 2) there are lots of functions that just need any kind of string and
>>    should accept all 3
>>
>> This kind of asks for type classes. There should be a read-from-string
>> type class that all 3 string types would fit. Then one could have one
>> function accepting a read-from-string type class and all 3 string
>> types could be passed. But unfortunately ocaml doesn't have type
>> classes.
>>
>> The next best thing would be enumerations (not in stdlib). Make
>> enumerations accept all 3 string types and then have everything else
>> accept enumerations. This would also mean you could pass a char list
>> or rope or any other type that gives you an enumeration of chars.
>>
>> 3) I/O code
>>
>> That the stdlib uses strings for I/O and needs to copy the data around
>> all the time has been nagging me for years. There certainly should be
>> read/write functions dealing with bigarrays.
>>
>> There also should be a function to create a bigarray with special
>> alignment (e.g. PAGESIZE) to get the best I/O performance (or in case
>> of linux async IO make it work at all).
>>
>> As for mutable/immutable strings there should be a read function
>> returning an immutable string, which it creates internally. The string
>> can't be passed as argument so creating a fresh one is the only way.
>>
>>
>> Here is a completly new point:
>>
>> 4) What is good for strings is also good for bigarray
>>
>> The same arguments concerning strings applies to bigarrays. Say you
>> pass a bigarray to the lexer. Can it just use it as is for its lexbuf
>> or does it need to copy it because it might mutate? An immutable
>> bigarray could be used savely as is.
>>
>>
>> And this doesn't realy stop at bigarray. Even references could be
>> read-only, in the sense of "this might change but YOU aren't allowed
>> to change it". And I think the only way to solve the
>> const/immutable/mutable/frozen sub-types that will be applicable to
>> more than just string is to use phantom types.
>>
>> MfG
>>         Goswin
>>
>> --
>> Caml-list mailing list.  Subscription management and archives:
>> https://sympa.inria.fr/sympa/arc/caml-list
>> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
>> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
>
>
> --
> Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-29  2:54       ` Yaron Minsky
@ 2014-07-29  9:46         ` Goswin von Brederlow
  2014-07-29 11:48         ` John F. Carr
  1 sibling, 0 replies; 29+ messages in thread
From: Goswin von Brederlow @ 2014-07-29  9:46 UTC (permalink / raw)
  To: caml-list

On Mon, Jul 28, 2014 at 10:54:36PM -0400, Yaron Minsky wrote:
> This isn't my idea, but it seems worth repeating: perhaps it would
> make sense to have an unmovable byte-array type that had the same
> memory representation as Bytes.t, with the extra guarantee that the
> collector wouldn't move it.
> 
> You could imagine representing this as a private type:
> 
> module Immovable_bytes : sig
>    type t = private Bytes.t
>    val create : int -> t
> end
> 
> with a special creation function for creating these immovable strings.
> This would avoid some of the current need to write what is effectively
> the same code twice, once for bigarrays and once for Bytes.t's.  In
> particular, you could modify an Immovable_bytes by first up-casting it
> to a Bytes.t.  But you could only actually create one by going through
> the special creation function.

Include Bytes and then redefine all the functions creating new Bytes
to use the special create. That way you can use all the Bytes
functions without upcasting.

> I'm not sure if the runtime details could be made to work out, but if
> they could, I think it would be a bit nicer than the current world.
> 
> y

I think that is simple enough to do. Simply have your create call
malloc and set the right header for the block and return the address +
header_size. Also add a Gc.finalise to free the memory when the block
becomes unreachable.

Or can you only finalise blocks inside the ocaml heap?


A slight problem though is that sometimes alignment is important (e.g
linux AIO needs block aligned data). You would have to allocate the
memory to be 4/8 bytes shy of block alignment, which means allocating
a chunk one block bigger and aligning the data within that larger
block. It also means you need to store the real address of the block
before or after the data and complicates your free function.

For page aligned data of page size that means you need to allocate (2
* PAGE_SIZE + 16) bytes of data whereas a bigarray only needs
PAGE_SIZE + the ocaml block on the major heap.



In general it could be nice to include `Immovable as phantom type next
to `Const, `Read and `Write. Then you could have immovable strings,
bytes, ...

MfG
	Goswin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
  2014-07-29  2:54       ` Yaron Minsky
  2014-07-29  9:46         ` Goswin von Brederlow
@ 2014-07-29 11:48         ` John F. Carr
  1 sibling, 0 replies; 29+ messages in thread
From: John F. Carr @ 2014-07-29 11:48 UTC (permalink / raw)
  To: Yaron Minsky; +Cc: caml-list


 > This isn't my idea, but it seems worth repeating: perhaps it would
 > make sense to have an unmovable byte-array type that had the same
 > memory representation as Bytes.t, with the extra guarantee that the
 > collector wouldn't move it.

I don't think this can be implemented without a significant runtime
change.  If your program is special enough to need pinned strings, try
turning off compaction with Gc.set.  The collector will never move
objects larger than 256 words.

It is easy to allocate a string from outside the heap, so the
collector can not move it, but then you have to do manual memory
management.  Manual memory management means explicit free or a
finalizer on an object you know will outlive any reference to the
string.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Caml-list] Immutable strings
       [not found]   ` <20140722.235104.405798419265248505.Christophe.Troestler@umons.ac.be>
@ 2014-08-29 16:30     ` Damien Doligez
  0 siblings, 0 replies; 29+ messages in thread
From: Damien Doligez @ 2014-08-29 16:30 UTC (permalink / raw)
  To: OCaml Mailing List

On 2014-07-22, at 23:51, Christophe Troestler wrote:

> What about having a phantom variable on bytes indicating access?  A
> string could become a "ro bytes" without copying.

Technically, that would work. In the latest developers meeting, we
decided against the phantom type approach because its main advantage
is also its main drawback: it takes advantage of the common
representation of string and bytes.

By keeping the two types separate, we get the freedom of representing
them differently. While we have no short-term plan to do that in the
normal OCaml runtime, we expect this to be a big win for the likes of
ocamljava and js_of_ocaml.

Also, as far as we can tell (and we need user feedback at this point)
strings and byte buffers are quite distinct in normal OCaml source,
so we wouldn't win much by being able to mix them.

We are also open to feedback and suggestions on convenience functions
that could be added to string.ml to help build strings in common cases
without going through a bytes value (http://caml.inria.fr/mantis/view.php?id=6500 )

-- Damien


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2014-08-29 16:30 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-04 19:18 [Caml-list] Immutable strings Gerd Stolpmann
2014-07-04 20:31 ` Anthony Tavener
2014-07-04 20:38   ` Malcolm Matalka
2014-07-04 23:44   ` Daniel Bünzli
2014-07-05 11:04   ` Gerd Stolpmann
2014-07-16 11:38     ` Damien Doligez
2014-07-04 21:01 ` Markus Mottl
2014-07-05 11:24   ` Gerd Stolpmann
2014-07-08 13:23     ` Jacques Garrigue
2014-07-08 13:37       ` Alain Frisch
2014-07-08 14:04         ` Jacques Garrigue
2014-07-28 11:14   ` Goswin von Brederlow
2014-07-28 15:51     ` Markus Mottl
2014-07-29  2:54       ` Yaron Minsky
2014-07-29  9:46         ` Goswin von Brederlow
2014-07-29 11:48         ` John F. Carr
2014-07-07 12:42 ` Alain Frisch
2014-07-08 12:24   ` Gerd Stolpmann
2014-07-09 13:54     ` Alain Frisch
2014-07-09 18:04       ` Gerd Stolpmann
2014-07-10  6:41         ` Nicolas Boulay
2014-07-14 17:40       ` Richard W.M. Jones
2014-07-08 18:15 ` mattiasw
2014-07-08 19:24   ` Daniel Bünzli
2014-07-08 19:27     ` Raoul Duke
2014-07-09 14:15   ` Daniel Bünzli
2014-07-14 17:45   ` Richard W.M. Jones
2014-07-21 15:06 ` Alain Frisch
     [not found]   ` <20140722.235104.405798419265248505.Christophe.Troestler@umons.ac.be>
2014-08-29 16:30     ` Damien Doligez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).