Data structures in ocaml

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* Data structures in ocaml
@ 1999-10-06 18:50 skaller
  1999-10-07 12:10 ` Pierre Weis
  0 siblings, 1 reply; 10+ messages in thread
From: skaller @ 1999-10-06 18:50 UTC (permalink / raw)
  To: caml-list

Some comments on data structures.
Please let me know if I miss something, which is possible.

1. It should be possible to create an uninitialised array.
[It can be done for string]

2. Arrays are mutable, but not variable length.

3. It isn't clear to me how fast concatenation
of sub arrays (or substrings) is. If I write

	Array.append 
		(Array.sub x xfirst xlen) 
		(Array.sub y yfirst ylen)

it isn't clear if the intermediate subarrays are
needlessly constructed or not. 

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Data structures in ocaml
  1999-10-06 18:50 Data structures in ocaml skaller
@ 1999-10-07 12:10 ` Pierre Weis
  1999-10-08 18:05   ` skaller
  0 siblings, 1 reply; 10+ messages in thread
From: Pierre Weis @ 1999-10-07 12:10 UTC (permalink / raw)
  To: skaller; +Cc: caml-list

> Some comments on data structures.
> Please let me know if I miss something, which is possible.
> 
> 1. It should be possible to create an uninitialised array.
> [It can be done for string]

No, it should not.

This is an easy consequence of the main theorem of Caml type-checking:
no well-typed Caml program can ever go wrong.

(Informal proof: there is no ill-typed Caml program; hence no Caml
program can manipulate a value whose type is different from the type
statically assigned to the value by the Caml type-checker; hence there
is no value with an unknown or invalid type; hence there is no
``invalid or impossible value'' in Caml (since this value will not
have a valid type); hence it is impossible to create an unitialised
array.)

This is also due to:

-- a design decision: no uninitialized entities exist in Caml. This is
consistent to the fact that in Caml you DEFINE identifiers instead of
DECLARING them.
-- an implementation necessity, since the garbage collector cannot be
faced with some unknown ill-formed data.

A way to mimick uninitialised array inside the Caml language framework
could be to systematically use option encapsulation, initialising the
array with None values, and storing Some x instead of x
afterward. This is not acceptable in practice since array accesses
would systematically need destructuring the result.

There are other possibilities that cannot be written into the
language, since they locally violates the correct memory management,
but that an implementation may provide. For instance, the run-time
system may define a special ``null'' (allocated) value that is used by
an external primitive to fill uninitialised arrays; then, the array
access primitive would test at each access if the value read is
identical (==) to ``null'' or not; if so, then the array access fails
and an error is raised, otherwise the value fetched is returned. This
means a loss of efficiency due to a spurious test for each array
access, and this is not desirable. This means also that you may have
random exceptions raised, due to the presence of random uninitialised
elements into arrays, and this is not desirable as well.

> 2. Arrays are mutable, but not variable length.

This would require an extra indirection for each vector access, which
is undesirable. Furthermore, variable length vectors can be defined
by an easy abstraction based on the basic static length vectors.

type 'a evect = ('a array) ref;;

let make n x = ref (Array.make n x);;

let give_room v n =
 let s = !v in
 let l = Array.length s in
 if n >= l then
  let new_s = Array.make n s.(0) in
  Array.blit s 0 new_s 0 l;
  v := new_s;;

let check_bound v n =
 if n < 0 || n >= Array.length !v then invalid_arg "evect";;

let get v n = check_bound v n; !v.(n);;
let set v n x = check_bound v n; !v.(n) <- x;;

> 3. It isn't clear to me how fast concatenation
> of sub arrays (or substrings) is. If I write
> 
> 	Array.append 
> 		(Array.sub x xfirst xlen) 
> 		(Array.sub y yfirst ylen)
> 
> it isn't clear if the intermediate subarrays are
> needlessly constructed or not. 

Remember a simple rule found in the FAQ: in Caml, there is no implicit
copy of data, to get a copy you must explicitely call a copying function. On
the other hand, if you call such a function you get what you ask for
and a copy is done.

Hence your question is: is there a copying function called in the body
of Array.sub ? The answer is yes, since we find in Array.sub:
 let r = create len ... in
 ...
 r

So yes, intermediate subarrays are constructed, since you explicitely
asked for their construction (however, in this case the storage they
use will be reclaimed by the next minor collection). You can imagine a
fancy optimizing compiler that can avoid their construction by
analyzing the code of Array.append (this is named ``deforestation'' in
the context of cons celles and lists append), but this analysis
remains to be invented for arrays.

Anyway, if you want to append two chunks of arrays with no
intermediate allocation, you just have to write a special purpose
function, something like:

let append_two_chunks v1 start1 end1 v2 start2 end2 =
 let result = Array.make ...
 Array.blit v1 start1 result 0 ...
 Array.blit v2 start2 result start1 ...
 result;;

(SMOP: you can also generalize the previous function to an arbitrary
list of chunks...)

Pierre Weis

INRIA, Projet Cristal, Pierre.Weis@inria.fr, http://cristal.inria.fr/~weis/

Pierre Weis

INRIA, Projet Cristal, Pierre.Weis@inria.fr, http://cristal.inria.fr/~weis/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Data structures in ocaml
  1999-10-07 12:10 ` Pierre Weis
@ 1999-10-08 18:05   ` skaller
  1999-10-09 23:17     ` Markus Mottl
  1999-10-09 23:52     ` Pierre Weis
  0 siblings, 2 replies; 10+ messages in thread
From: skaller @ 1999-10-08 18:05 UTC (permalink / raw)
  To: Pierre Weis; +Cc: caml-list

Pierre Weis wrote:
> 
> > Some comments on data structures.
> > Please let me know if I miss something, which is possible.
> >
> > 1. It should be possible to create an uninitialised array.
> > [It can be done for string]
> 
> No, it should not.

	It is _necessary_  for efficiency, to represent
procedural code cleanly, and it is possible that it can
be done in a _well typed_ manner. See my proposal for
an explicit categorical initial.

> This is an easy consequence of the main theorem of Caml type-checking:
> no well-typed Caml program can ever go wrong.

	I do not understand this (obviously simplified) assertion.
As stated, it is clearly not the case. Ocaml programs can
easily go wrong in many ways: infinite loops, core dumps
due to stack overflows, exceptions due to invalid function
arguments ....

	Perhaps you mean that the type system is sound, but that is
a relative thing: the soundness is useful only within
the context of it's expressive power, and like most
conventional type systems that is definitely limited in practice.

	There are a whole host of possible run time errors in ocaml,
which represent a lack of expressiveness of the type system.

> This is also due to:
> 
> -- a design decision: no uninitialized entities exist in Caml. This is
> consistent to the fact that in Caml you DEFINE identifiers instead of
> DECLARING them.

	I understand that this decision was made, and why, but it does not
sit so well with the procedural aspect of ocaml.

> -- an implementation necessity, since the garbage collector cannot be
> faced with some unknown ill-formed data.

	This only follows from an overly 'strict' interpretation of
'uninitialised'. A pragmatic implementation might well initialise
boxed values for the reason you state, but that need not mean
the client programmer must. The system can surely do this
more efficiently, if it must be done for internal implementation
reasons.
 
> A way to mimick uninitialised array inside the Caml language framework
> could be to systematically use option encapsulation, initialising the
> array with None values, and storing Some x instead of x
> afterward.

	Yes. So the idea is to make the compiler do that,
for all values.

> This is not acceptable in practice since array accesses
> would systematically need destructuring the result.

	Precisely.
 
> There are other possibilities that cannot be written into the
> language, since they locally violates the correct memory management,
> but that an implementation may provide. For instance, the run-time
> system may define a special ``null'' (allocated) value that is used by
> an external primitive to fill uninitialised arrays; then, the array
> access primitive would test at each access if the value read is
> identical (==) to ``null'' or not; if so, then the array access fails
> and an error is raised, otherwise the value fetched is returned. This
> means a loss of efficiency due to a spurious test for each array
> access, and this is not desirable. 

	I do not care that efficiency is lost while debugging,
and the spurious test can be turned off with the -unsafe option
just like bounds checking.

> > 2. Arrays are mutable, but not variable length.
> 
> This would require an extra indirection for each vector access, which
> is undesirable. 

	you are missing the point: I am NOT suggesting
changing the existing Array module, but considering the problem
of implementing a new module for a variable length one.
In this case the 'undesirability' of the extra indirection
is completely irrelevant: it is necessary IF you in fact
want variable length arrays, and I MUST have them
or I cannot implement Python lists efficiently (Python uses
variable length arrays for lists, using realloc to reallocate them).

>Furthermore, variable length vectors can be defined
> by an easy abstraction based on the basic static length vectors.

	No they can't. I know, I'm trying to do it.
It is not at all 'easy', but grossly obscures code clarity.
To see this, just consider that your code below is, in fact, wrong:

> type 'a evect = ('a array) ref;;
> 
> let make n x = ref (Array.make n x);;
> 
> let give_room v n =
>  let s = !v in
>  let l = Array.length s in
>  if n >= l then
>   let new_s = Array.make n s.(0) in

	What happens if Array.length s = 0??

	You see, it is NOT possible to implement the 
'give_room' function. Instead, you must provide special
code in EVERY function of the module to do the
reallocation, and you must put dummy values at the end
of the array, which will not be collected. This code
must be special purpose for each function, to find
a suitable dummy value: if there is no such value,
the array can be set to zero length: it CAN be done,
but it is a mess, it is inefficent, and it can
waste a lot of storage, particularly if,
as in my implementation, I chose to leave 'old' values
lying around in the unused part instead of resetting
them all to a single (shared) dummy value.
	
> Anyway, if you want to append two chunks of arrays with no
> intermediate allocation, you just have to write a special purpose
> function, something like:
> 
> let append_two_chunks v1 start1 end1 v2 start2 end2 =
>  let result = Array.make ...
>  Array.blit v1 start1 result 0 ...
>  Array.blit v2 start2 result start1 ...
>  result;;

	Yes, and this requires getting a dummy value to initialise
the whole array with, followed by immediately overwriting
that dummy value with the actual data, and thus is twice
as slow as C.
 
> (SMOP: you can also generalize the previous function to an arbitrary
> list of chunks...)

I have. And it is even messier to find the dummy value then;
I need a recursive function scanning the list of arrays to
be concatenated just to find the dummy value,
and if all the arguments are zero length, you have to
handle it as a special case.

----------------------

I think my point is this: when building a library module,
it is useful to have extra code to handle the magic
initial value, to find bugs. When the library is thought
to be correct, it can be recompiled with the -unsafe option.
The result is then fast, although it may crash the program
if there is still a bug -- but if there is a bug, the program
is bugged _anyhow_, and all that is lost is some extra
diagnostic information, which can sometimes be recovered
by rerunning the program on the same data, but with the
program re-built with the debugging version of the library.

The point is that not much safety is lost, since only that
module is compiled with -unsafe, and only when it is
well enough tested and analysed to have reasonable confidence
in its correctness.

The alternative is to write code three times longer
than would otherwise be required, which is MORE likely
to be bugged, even though the bug will not be from an
uninitialised value.

The tradeoff is: catching a specific kind of
error automatically compared with
missing a dynamic error which is more likely
to occur.

In the variable length array case there is no issue:
IF I fail to set an array element to the correct value,
THEN accessing it is an error. It does not matter whether
the value is the wrong value, or an uninitialised one,
except perhaps that a core dump from an uninitialised one
would be easier to trace! [Especially if the code were simpler]

In other words, the decision to force initialisation of values
even when it is inappropriate is some times a _disservice_
to the programmer.  I recommend giving the programmer a choice.
I think this is viable, since the choice can be made
on a per compilation unit basis.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Data structures in ocaml
  1999-10-08 18:05   ` skaller
@ 1999-10-09 23:17     ` Markus Mottl
  1999-10-09 23:52     ` Pierre Weis
  1 sibling, 0 replies; 10+ messages in thread
From: Markus Mottl @ 1999-10-09 23:17 UTC (permalink / raw)
  To: skaller; +Cc: OCAML

> 	You see, it is NOT possible to implement the 
> 'give_room' function. Instead, you must provide special
> code in EVERY function of the module to do the
> reallocation, and you must put dummy values at the end
> of the array, which will not be collected. This code
> must be special purpose for each function, to find
> a suitable dummy value: if there is no such value,
> the array can be set to zero length: it CAN be done,
> but it is a mess, it is inefficent, and it can
> waste a lot of storage, particularly if,
> as in my implementation, I chose to leave 'old' values
> lying around in the unused part instead of resetting
> them all to a single (shared) dummy value.

My solution in the (soon to be released) resizable array module is as
follows (I hope I haven't overlooked any problem with this approach):

Elements of arrays which are not int- or float-arrays are always boxed.
If we do not want to waste space in the "free" part of the array, we
will have to initialize it with a special value. We could have the user
pass one to us, but he would have to create it and this can cost memory
if this value is very space-hungry and it is also not very elegant.

Thus, we would like to have a default value for all kinds of types which
does not need much space and needs not be known to the user.

If I have understood the garbage collector correctly, it does not know
anything about things going on in the type system, but it knows how
to recognize the basic types (atoms). If we take a very "small" atom
(e.g. an integer) and "cast" it to the required type (using Obj.magic),
we can cheat and make the compiler believe that we have a correct value
of the type of the array elements.

The garbage collector on the other hand does not care about the "true"
type of the array elements. Because they are boxed (otherwise the GC
wouldn't scan through the array), it will believe that there are boxed
integers in the "free" slots (more precisely, it's always the same
integer which is "contained" in the box).

With this trick we can initialize the array and (very important) have
removed elements be reclaimed when we overwrite their "box" so that it
points to the integer.

> 	Yes, and this requires getting a dummy value to initialise
> the whole array with, followed by immediately overwriting
> that dummy value with the actual data, and thus is twice
> as slow as C.

In the case of unboxed arrays (the same can be said about strings), we
need not do this "overwriting", because the values will not be reclaimed,
anyway (wouldn't make much sense). Thus, for these kinds of "contiguous
memory" it is ok to simply adjust the index to the last element.

For strings, there is a create-function that doesn't initialize the
memory.  If there were a similar function for int- and float-arrays, I
could greatly improve the speed of array creation in my module and (very
important) for reallocation if the array is being resized. Maybe I will
supply a C-function that allocates such arrays without initialization. But
I will rather focus on functionality first...

Regards,
Markus Mottl

-- 
Markus Mottl, mottl@miss.wu-wien.ac.at, http://miss.wu-wien.ac.at/~mottl

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Data structures in ocaml
  1999-10-08 18:05   ` skaller
  1999-10-09 23:17     ` Markus Mottl
@ 1999-10-09 23:52     ` Pierre Weis
  1999-10-10  8:14       ` skaller
  1999-10-10 12:56       ` Michel Quercia
  1 sibling, 2 replies; 10+ messages in thread
From: Pierre Weis @ 1999-10-09 23:52 UTC (permalink / raw)
  To: skaller; +Cc: caml-list

> 	It is _necessary_  for efficiency, to represent
> procedural code cleanly, and it is possible that it can
> be done in a _well typed_ manner. See my proposal for
> an explicit categorical initial.

10 years ago, I designed and implemented for Caml the keyword
``any''. Any was an expression, with polymorphic type scheme
For all 'a. 'a.

The typing rule for ``any'' was regular except that the type variable
instance of any's type scheme must be instanciated at the end of
type-checking (since there is no Caml value of type 'a).

The semantics was that (any : t) was any value of type t.

For instance:

-- (any : int) was 0
-- (any : string) was ""
...
-- (any : t) where t was a sum type was the first constant constuctor if
any ot the first functional constructor C of ty with (any : ty) as
argument. (Hence (any : string list) was [].)
-- (any : t) where t was a record type was the list of field with the
value (any : ty) in each field (ty according to the label argument).
-- (any : t1 -> t2) was (fun _ -> raise Bottom)

This is well-typed and semantically sound.

This feature is conceivably useful to initialize variables (I mean
references) or vectors.

> > This is an easy consequence of the main theorem of Caml type-checking:
> > no well-typed Caml program can ever go wrong.
> 
> 	I do not understand this (obviously simplified) assertion.
> As stated, it is clearly not the case. Ocaml programs can
> easily go wrong in many ways: infinite loops, core dumps
> due to stack overflows, exceptions due to invalid function
> arguments ...

You're perfectly right.

> 	Perhaps you mean that the type system is sound, but that is
> a relative thing: the soundness is useful only within
> the context of it's expressive power, and like most
> conventional type systems that is definitely limited in practice.

You're perfectly right.

> 	There are a whole host of possible run time errors in ocaml,
> which represent a lack of expressiveness of the type system.

Exactly, that's our dream to have a more expresive type system that
would prevent those run time errors.

> > This is also due to:
> > 
> > -- a design decision: no uninitialized entities exist in Caml. This is
> > consistent to the fact that in Caml you DEFINE identifiers instead of
> > DECLARING them.
> 
> 	I understand that this decision was made, and why, but it does not
> sit so well with the procedural aspect of ocaml.

That's your opinion, which is interesting.

> > -- an implementation necessity, since the garbage collector cannot be
> > faced with some unknown ill-formed data.
> 
> 	This only follows from an overly 'strict' interpretation of
> 'uninitialised'. A pragmatic implementation might well initialise
> boxed values for the reason you state, but that need not mean
> the client programmer must. The system can surely do this
> more efficiently, if it must be done for internal implementation
> reasons.

The system simply initializes vectors with the value supplied by the
user. If it did it with a special implicit value, the lack of
efficiency would be the same.

> > There are other possibilities that cannot be written into the
> > language, since they locally violates the correct memory management,
> > but that an implementation may provide. For instance, the run-time
> > system may define a special ``null'' (allocated) value that is used by
> > an external primitive to fill uninitialised arrays; then, the array
> > access primitive would test at each access if the value read is
> > identical (==) to ``null'' or not; if so, then the array access fails
> > and an error is raised, otherwise the value fetched is returned. This
> > means a loss of efficiency due to a spurious test for each array
> > access, and this is not desirable. 
> 
> 	I do not care that efficiency is lost while debugging,
> and the spurious test can be turned off with the -unsafe option
> just like bounds checking.

I do not understand: the efficiency is lost anyway as soon as the
vectors have to be initialized, since this initialization is an O(n)
operation. Also if you don't perform the test at array access, you
will have a very difficult time to debug the program, since the
``null'' value could be a valid value of the type stored into the
array, and the program may crash long time after the erroneous access
or even worse continue with this erroneous value and give irrelevant
results ...

> > > 2. Arrays are mutable, but not variable length.
> > 
> > This would require an extra indirection for each vector access, which
> > is undesirable. 
> 
> 	you are missing the point: I am NOT suggesting
> changing the existing Array module, but considering the problem
> of implementing a new module for a variable length one.

You're right, I misunderstood the phrase ``2. Arrays are mutable, but not
variable length.''. I did not catch it meant ``I'm considering of
implementing a new module for a variable length one.''

> In this case the 'undesirability' of the extra indirection
> is completely irrelevant: it is necessary IF you in fact
> want variable length arrays, and I MUST have them
> or I cannot implement Python lists efficiently (Python uses
> variable length arrays for lists, using realloc to reallocate them).

You're certainly right. However you did not give the benchmark that
proves that Caml lists are way too inefficient to implement the Python
lists! Can you please give us the specification you need, that is, the
kind of primitives you want to implement efficiently, and the runtime
figures you got with the Python implementation (i.e. a benchmark) ?

> >Furthermore, variable length vectors can be defined
> > by an easy abstraction based on the basic static length vectors.
> 
> 	No they can't. I know, I'm trying to do it.
> It is not at all 'easy', but grossly obscures code clarity.
> To see this, just consider that your code below is, in fact, wrong:
> 
> > type 'a evect = ('a array) ref;;
> > 
> > let make n x = ref (Array.make n x);;
> > 
> > let give_room v n =
> >  let s = !v in
> >  let l = Array.length s in
> >  if n >= l then
> >   let new_s = Array.make n s.(0) in
> 
> 	What happens if Array.length s = 0??

Invalid array access. I implicitely supposed that the vectors were not
empty. If they can, just change the module as

type 'a evect = {init : 'a; mutable contents : 'a array};;

let make n x = {init = x; contents = Array.make n x};;

let give_room v n =
  let s = v.contents in
  let l = Array.length s in
  if n >= l then
   let new_s = Array.make n v.init in
   ...

Anyway, if you use those vectors to implement lists, the variable length
vectors you need are not empty, and you can use the original type I
posted, since you do not use these vectors in the case of empty
lists. (Your list datatype should look like:

type 'a plist =
   | Empty
   | Cons of 'a evect;;

isn't it ?)

> 	You see, it is NOT possible to implement the 
> 'give_room' function.

I don't understand.

> Instead, you must provide special
> code in EVERY function of the module to do the
> reallocation, and you must put dummy values at the end
> of the array, which will not be collected. This code
> must be special purpose for each function, to find
> a suitable dummy value: if there is no such value,
> the array can be set to zero length: it CAN be done,
> but it is a mess, it is inefficent, and it can
> waste a lot of storage, particularly if,
> as in my implementation, I chose to leave 'old' values
> lying around in the unused part instead of resetting
> them all to a single (shared) dummy value.

If you have this problem, you can use option type (as I mentioned in
my previous message) in the module implementing variable length
vectors. ``Uninitialized'' elements will be None values that are not
garbbage collected anyway. Option manipulation will be local to the
set and get functions, and completely transparent to the clients of
the module.

If you have those ``dummy'' values in vectors, you may also try weak
pointers.

Also there are zillion of extensible vectors implementations, for
instance those based on a default value (which appears very often in
the vector and is never stored in the vector at all), with hash tables
(sparse vectors), trees of vectors of a fixed length, vectors of
vectors ...

> > Anyway, if you want to append two chunks of arrays with no
> > intermediate allocation, you just have to write a special purpose
> > function, something like:
> > 
> > let append_two_chunks v1 start1 end1 v2 start2 end2 =
> >  let result = Array.make ...
> >  Array.blit v1 start1 result 0 ...
> >  Array.blit v2 start2 result start1 ...
> >  result;;
> 
> 	Yes, and this requires getting a dummy value to initialise
> the whole array with, followed by immediately overwriting
> that dummy value with the actual data, and thus is twice
> as slow as C.

If you do not want to pay for the security to be sure that all values
are always valid, I guess you must write this function directly in C ...

> > (SMOP: you can also generalize the previous function to an arbitrary
> > list of chunks...)
> 
> I have. And it is even messier to find the dummy value then;
> I need a recursive function scanning the list of arrays to
> be concatenated just to find the dummy value,
> and if all the arguments are zero length, you have to
> handle it as a special case.

Anyway you have to write such a function to compute the length of the
resulting array: I don't think it is difficult to also return an
initialisation value option with the total length of the vector...

> ----------------------
> 
> I think my point is this: when building a library module,
> it is useful to have extra code to handle the magic
> initial value, to find bugs. When the library is thought
> to be correct, it can be recompiled with the -unsafe option.
> The result is then fast, although it may crash the program
> if there is still a bug -- but if there is a bug, the program
> is bugged _anyhow_, and all that is lost is some extra
> diagnostic information, which can sometimes be recovered
> by rerunning the program on the same data, but with the
> program re-built with the debugging version of the library.

There is a philosophical problem here, since one of the fundamental
idea of Caml-like languages is just to
 1) prevents runtime errors as much as possible via static analyses.
 2) give you some meaningful diagnostic in case of error (something
different from ``bus error'')

We very often chose to jeopardise efficiency in favor of safety and good error
diagnostic. For instance, we considered several times to remove the
-unsafe option. If we had implemented a good array bound checkers, we
certainly had removed it ...

> The point is that not much safety is lost, since only that
> module is compiled with -unsafe, and only when it is
> well enough tested and analysed to have reasonable confidence
> in its correctness.

You're right. That's always what we claim to the clients of our
programs: ``it is well enough tested and analysed to have reasonable
confidence in its correctness.'' Unfortunately, we also have bug
reports ...

> The alternative is to write code three times longer
> than would otherwise be required, which is MORE likely
> to be bugged, even though the bug will not be from an
> uninitialised value.
> 
> The tradeoff is: catching a specific kind of
> error automatically compared with
> missing a dynamic error which is more likely
> to occur.
> 
> In the variable length array case there is no issue:
> IF I fail to set an array element to the correct value,
> THEN accessing it is an error. It does not matter whether
> the value is the wrong value, or an uninitialised one,
> except perhaps that a core dump from an uninitialised one
> would be easier to trace! [Especially if the code were simpler]
> 
> In other words, the decision to force initialisation of values
> even when it is inappropriate is some times a _disservice_
> to the programmer.  I recommend giving the programmer a choice.
> I think this is viable, since the choice can be made
> on a per compilation unit basis.
>
> John Skaller, mailto:skaller@maxtal.com.au
> 1/10 Toxteth Rd Glebe NSW 2037 Australia
> homepage: http://www.maxtal.com.au/~skaller
> downloads: http://www.triode.net.au/~skaller

You're right. We had often the same kind of discussions and the same
kind of arguments with people from other programming groups (C, Perl,
Scheme, Ada, ...): very often they replace ``uninitialised arrays'' by
``strongly typed'' or ``pattern matching exaustivity''; and have
exactly the same arguments ``sometime inappropriate'', ``_disservice_
to the programmer'', ``core dump easier to trace'', ``simpler code if
all those lifegards are omitted'', ``I don't need those static checks
since I careful test and analyze my code''.

Once more, it is a matter of philosophical opinions: in Caml the
language ensures some invariants. This is a strong and convenient
property, that you may like or dislike. And unfortunately this
property has a price ... 

Pierre Weis

INRIA, Projet Cristal, Pierre.Weis@inria.fr, http://cristal.inria.fr/~weis/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Data structures in ocaml
  1999-10-09 23:52     ` Pierre Weis
@ 1999-10-10  8:14       ` skaller
  1999-10-10 12:56       ` Michel Quercia
  1 sibling, 0 replies; 10+ messages in thread
From: skaller @ 1999-10-10  8:14 UTC (permalink / raw)
  To: Pierre Weis; +Cc: caml-list

Pierre Weis wrote:

> 10 years ago, I designed and implemented for Caml the keyword
> ``any''. Any was an expression, with polymorphic type scheme
> For all 'a. 'a.
> 
> The typing rule for ``any'' was regular except that the type variable
> instance of any's type scheme must be instanciated at the end of
> type-checking (since there is no Caml value of type 'a).
> 
> The semantics was that (any : t) was any value of type t.

	A don't care value. This slightly different to 'initial',
which is more like 'not any'. :-)
 
> This feature is conceivably useful to initialize variables (I mean
> references) or vectors.

	Perhaps (but I don't see how it is always possible to
provide an 'any' instance of an object of class type?)

> >       There are a whole host of possible run time errors in ocaml,
> > which represent a lack of expressiveness of the type system.
> 
> Exactly, that's our dream to have a more expresive type system that
> would prevent those run time errors.

	I share your dream: but add to it the need to do things
efficiently, as well, and to support sharing of commonality
(that is, reuse). 
 
> > > This is also due to:
> > >
> > > -- a design decision: no uninitialized entities exist in Caml. This is
> > > consistent to the fact that in Caml you DEFINE identifiers instead of
> > > DECLARING them.
> >
> >       I understand that this decision was made, and why, but it does not
> > sit so well with the procedural aspect of ocaml.
> 
> That's your opinion, which is interesting.

	A lot of work is done in C++, concerning initialisation
of class objects. As you may know, C++ allows the data of a class
to be initialised componentwise, and then also permits assignment
to the values in the constructor body. I have conjectured that
it is always possible to write the code with an empty constructor
body, that is, entirely functionally. I have not seen a contrary
example, but there are plenty of examples where it is very
messy and inconvenient not to leave the value uninitialised,
and assign to it later in the constructor body.
 
> The system simply initializes vectors with the value supplied by the
> user. If it did it with a special implicit value, the lack of
> efficiency would be the same.

	Accepted.
 
> >       I do not care that efficiency is lost while debugging,
> > and the spurious test can be turned off with the -unsafe option
> > just like bounds checking.
> 
> I do not understand: the efficiency is lost anyway as soon as the
> vectors have to be initialized, since this initialization is an O(n)
> operation. 

	The issue here isn't just initialisation, but also testing
for the special 'initialised' value. If I use 'a option,
I must write code (ugly) to match the case which always
occurs, and presumably that takes time:

	match x.(i) with Some x' -> x' | None -> assert false

is surely slower (and unquestionably uglier) than

	x.(i) 

where the latter throws if x.(i) is uninitialised OR past the end of the
array.
I can turn off the bound checking with -unsafe, and if I could turn off
the
uninitialised checking as well, then the cost of an access would be the
same as for a fixed length array.

>Also if you don't perform the test at array access, you
> will have a very difficult time to debug the program, since the
> ``null'' value could be a valid value of the type stored into the
> array, and the program may crash long time after the erroneous access
> or even worse continue with this erroneous value and give irrelevant
> results ...

	The idea is that the system check each access, but the checks
be elided if -unsafe is specified. The result is cleaner than either
using a dummy value (which is actually unsafe, since reading the dumyy
_doesn't_ throw an exception when it should), or using 'a option.

 
> You're certainly right. However you did not give the benchmark that
> proves that Caml lists are way too inefficient to implement the Python
> lists! Can you please give us the specification you need, that is, the
> kind of primitives you want to implement efficiently, and the runtime
> figures you got with the Python implementation (i.e. a benchmark) ?

	Sure. but note, it is not necessary to give a benchmark:
python lists support constant time random access, ocaml ones
are O(n), and that is enough. Indexing is common. Concatenation is
also common and must be fast. Here are the results from an appending
test:


Viper (ocaml) 2.0.1 
Concatenate python list (ocaml Varray) 317  secs 
Concatenate python tuple (ocaml list) 794  secs 

Python (C)
Concatenate python list (ocaml Varray) 54.33  secs
Concatenate python tuple (ocaml list) 48.98  secs

The test code is:
-----------------------------
from time import clock
from sys import version
print version

loops = 10000
x0 = [1,2,3,4]
y0 = (1,2,3,4)

x = []
y = ()

start = clock()
for i in xrange(loops):
  x = x + x0
print 'Concatenate python list (ocaml Varray)', clock() - start,' secs'

start = clock()
for i in xrange(loops):
  y = y + y0
print 'Concatenate python tuple (ocaml list)', clock() - start, ' secs'
------------------------------------

> lists. (Your list datatype should look like:
> 
> type 'a plist =
>    | Empty
>    | Cons of 'a evect;;
> 
> isn't it ?)

	No, I just use a variable length array:

	type varray_t = { used: int; data: 'a array }

 
> > I think my point is this: when building a library module,
> > it is useful to have extra code to handle the magic
> > initial value, to find bugs. When the library is thought
> > to be correct, it can be recompiled with the -unsafe option.
> > The result is then fast, although it may crash the program
> > if there is still a bug -- but if there is a bug, the program
> > is bugged _anyhow_, and all that is lost is some extra
> > diagnostic information, which can sometimes be recovered
> > by rerunning the program on the same data, but with the
> > program re-built with the debugging version of the library.
> 
> There is a philosophical problem here, since one of the fundamental
> idea of Caml-like languages is just to
>  1) prevents runtime errors as much as possible via static analyses.
>  2) give you some meaningful diagnostic in case of error (something
> different from ``bus error'')

	I agree with this philosophy, but would temper it with the
condition that not too much efficiency is lost, and that not too
much expressiveness is lost.
 
> We very often chose to jeopardise efficiency in favor of safety and good error
> diagnostic. For instance, we considered several times to remove the
> -unsafe option. If we had implemented a good array bound checkers, we
> certainly had removed it ...

	Yes, and I am sure each time you make this kind of compromise
you are asking for some research to show a way to get safety
without loss of expressiveness or efficiency.
 
> > The point is that not much safety is lost, since only that
> > module is compiled with -unsafe, and only when it is
> > well enough tested and analysed to have reasonable confidence
> > in its correctness.
> 
> You're right. That's always what we claim to the clients of our
> programs: ``it is well enough tested and analysed to have reasonable
> confidence in its correctness.'' Unfortunately, we also have bug
> reports ...

	This will be the case until better languages are 
designed where the compiler can prove correctness in more cases.
It seems to be a good way forward is to consider functors
as a programming technique, since a functor instantiating
an instance of an abstraction guarrantees the instance
is a correct instance of the abstraction.
 
> You're right. We had often the same kind of discussions and the same
> kind of arguments with people from other programming groups (C, Perl,
> Scheme, Ada, ...): very often they replace ``uninitialised arrays'' by
> ``strongly typed'' or ``pattern matching exaustivity''; and have
> exactly the same arguments ``sometime inappropriate'', ``_disservice_
> to the programmer'', ``core dump easier to trace'', ``simpler code if
> all those lifegards are omitted'', ``I don't need those static checks
> since I careful test and analyze my code''.
> 
> Once more, it is a matter of philosophical opinions: in Caml the
> language ensures some invariants. This is a strong and convenient
> property, that you may like or dislike. And unfortunately this
> property has a price ...

	I believe I share much the same philosophy. But when I
see unnatural code, or inefficiency, imposed by the current
implementation consequences of the philosophy, I question
both the philosophy and the implementation consquences.

	In this case I ask, if a well principled solution
is not to be found from the underlying category theory,
which would not change the philosophy.

	In this particular case, it would probably be correct to
permit a value to be temporarily uninitialised, provided
the compiler could prove it was initialised before use.

	Since this isn't always the case, the question
is whether to ban the construction, or resort to a run time
error. In the similar case of array bounds, ocaml takes
the latter approach. Therefore, doing so with uninitialised
values is within the current philosophy.

	So I think it is simply a question of judgement of utility.
The only opinion I offer on that is that I have ONE case where
there would be some benefit.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Data structures in ocaml
  1999-10-09 23:52     ` Pierre Weis
  1999-10-10  8:14       ` skaller
@ 1999-10-10 12:56       ` Michel Quercia
  1999-10-10 21:16         ` Pierre Weis
  1 sibling, 1 reply; 10+ messages in thread
From: Michel Quercia @ 1999-10-10 12:56 UTC (permalink / raw)
  To: Pierre Weis

Le sam, 09 oct 1999, vous avez écrit :

: For instance:
: 
: -- (any : int) was 0
: -- (any : string) was ""
: ...
: -- (any : t) where t was a sum type was the first constant constuctor if
: any ot the first functional constructor C of ty with (any : ty) as
: argument. (Hence (any : string list) was [].)
: -- (any : t) where t was a record type was the list of field with the
: value (any : ty) in each field (ty according to the label argument).-- (any : t1 -> t2) was (fun _ -> raise Bottom)
: 
: This is well-typed and semantically sound.

What about : "type t = T of int*t" (immutable circular int lists) ?
Well there is a solution : "let rec dummy = T(0,dummy)" but in more complicated
recursive type definitions the compiler will have to work hard to produce some
plausible value, it seems to me that this should always be possible, am I right
?

Concerning varying arrays,
--------------------------
I had an idea not far from Markus Mottl's own.
Define a new tag, let's call it Vtag flagging varying size (that is both growing
and shrinking) objects. A Vobject would say :

"Hello everybody, I'm a varying size object, I have TWO lengths, an allocation
length stored in my header and an actual length stored in my first field. And
you can pretty trust in me, the actual length never greater that the allocation
one will be.

Mr. GC, would be so kind as to not put your nose past my actual length ? There
is nothing interresting for you there.

Mr. compiler, please check index bounds against my actual length, and shift
actual index by one in order to retrieve the indexed value."

Now the need for uninitialized values disappears as well as the need to erase
pointers when shrinking a Varray ... Nice, but I have some doubts about O'caml
implementors accepting my proposal.

Concerning Lists implemented by varying arrays,
-----------------------------------------------
An awfull question pops up inside of me, how can one implement :

let l1 = a :: l and l2 = b :: l

if lists are actually arrays (with head at the end) without copying l ?

--
Michel Quercia
9/11 rue du grand rabbin Haguenauer, 54000 Nancy
http://pauillac.inria.fr/~quercia
mailto:quercia@cal.enst.fr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Data structures in ocaml
  1999-10-10 12:56       ` Michel Quercia
@ 1999-10-10 21:16         ` Pierre Weis
  1999-10-11  5:37           ` Lyn A Headley
  0 siblings, 1 reply; 10+ messages in thread
From: Pierre Weis @ 1999-10-10 21:16 UTC (permalink / raw)
  To: quercia; +Cc: caml-list

> : any of the first functional constructor C of ty with (any : ty) as
> : argument. (Hence (any : string list) was [].)
[...]

> What about : "type t = T of int*t" (immutable circular int lists) ?
> Well there is a solution : "let rec dummy = T(0,dummy)" but in more
> complicated recursive type definitions the compiler will have to
> work hard to produce some plausible value, it seems to me that this
> should always be possible, am I right ?

This is always possible, and this was done properly via mutually
recursive definition of identifiers (let rec t0 = T (0, t0) in the
case of type t for instance). The real problems arrive when any is
encapsulated into polymorphic functions (for instance, ``any'' cannot
help to initialize a ``statically polymorphic'' array, that is an array
whose elements are statically completely unknown...). Then you must
extend the type-system with ``generic polymorphism'' to get types at
run-time (see the article "Extensional polymorphism" by Catherine
Dubois Francois Rouaix and Pierre Weis, POPL'95).

> Concerning varying arrays,
> --------------------------
> I had an idea not far from Markus Mottl's own.
> Define a new tag, let's call it Vtag flagging varying size (that is
> both growing and shrinking) objects. A Vobject would say :
[...]
> to erase pointers when shrinking a Varray ... Nice, but I have some
> doubts about O'caml implementors accepting my proposal.

Nice and simple enough anyway. Worth a try, if we are sure that ``growing
and shrinking objects'' are indeed reasonably general to be worth the
implementation effort (ask mister GC for a new tag, ask mister GC for
an addition to the mark and copy phases, decrement the maximum number
of constructors in sum types)...

> Concerning Lists implemented by varying arrays,
> -----------------------------------------------
> An awfull question pops up inside of me, how can one implement :
> 
> let l1 = a :: l and l2 = b :: l
> 
> if lists are actually arrays (with head at the end) without copying l ?

This question is difficult, I don't know how to obtain the same
sharing as usual lists as soon as vectors are used. For instance, I
don't know how to have a cons in constant time, since the vector may
overflow, that means that you have to extend it and copy all the
elements of the ``list'' in the fresh vector. Furthermore, if your
lists share sub lists it becomes extremely difficult not to keep some
pointer to the old vector, thus genrating lots of memory leaks.

More generally, concerning the lists John Skaller tries to implement,
we cannot really understand his problems as soon as we don't know what
is the complexity he needs for all the basic functions on his lists:

For O'Caml we have:
hd : O(1), tl : O(1), cons : O(1), nth l n : O(n)
append l1 l2 : O(len (l1))
append_lists [l1; l2; ... ln] : O(len (l1) + len (l2) + ... + len (ln-1))

What are the figures for Python lists ?
Are there other significant function for the list package ?

Pierre Weis

INRIA, Projet Cristal, Pierre.Weis@inria.fr, http://cristal.inria.fr/~weis/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Data structures in ocaml
  1999-10-10 21:16         ` Pierre Weis
@ 1999-10-11  5:37           ` Lyn A Headley
  1999-10-12 18:52             ` skaller
  0 siblings, 1 reply; 10+ messages in thread
From: Lyn A Headley @ 1999-10-11  5:37 UTC (permalink / raw)
  To: Pierre Weis; +Cc: caml-list

>>>>> "Pierre" == Pierre Weis <Pierre.Weis@inria.fr> writes:

    Pierre> For O'Caml we have: hd : O(1), tl : O(1), cons : O(1), nth
    Pierre> l n : O(n) append l1 l2 : O(len (l1)) append_lists [l1;
    Pierre> l2; ... ln] : O(len (l1) + len (l2) + ... + len (ln-1))

    Pierre> What are the figures for Python lists ?  Are there other
    Pierre> significant function for the list package ?


Python "lists" are based on arrays, so I believe the performance is as
follows (some of these functions don't really exists, but would be
"faked" using slices or somesuch):

hd: O(1) 
tl: O(n - 1) 
cons(a, b): O(len(b) + 1)
nth(l, n): O(1)
append(l1 l2): O(len(l1) + len(l2))
append_lists([l1, l2, ... ln]) not sure if this exists.  The naive 
implementation using "+" would be quadratic.

-Lyn





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Data structures in ocaml
  1999-10-11  5:37           ` Lyn A Headley
@ 1999-10-12 18:52             ` skaller
  0 siblings, 0 replies; 10+ messages in thread
From: skaller @ 1999-10-12 18:52 UTC (permalink / raw)
  To: Lyn A Headley; +Cc: Pierre Weis, caml-list

Lyn A Headley wrote:

> Python "lists" are based on arrays, so I believe the performance is as
> follows (some of these functions don't really exists, but would be
> "faked" using slices or somesuch):

> append_lists([l1, l2, ... ln]) not sure if this exists.  The naive
> implementation using "+" would be quadratic.

That's an interesting observation: thanks for pointing it
out. In fact, expressions like

	x = a + b + c + d

are common manipulating strings in Python. Because I represent this
as a list, _not_ recursively, I can get linear performance,
if there is a way to preallocate the storage. [I can't remember
off hand if the 'Buffer' class allows this]. I can also do it 
for (python) lists, using Varray.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~1999-10-13  6:59 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-10-06 18:50 Data structures in ocaml skaller
1999-10-07 12:10 ` Pierre Weis
1999-10-08 18:05   ` skaller
1999-10-09 23:17     ` Markus Mottl
1999-10-09 23:52     ` Pierre Weis
1999-10-10  8:14       ` skaller
1999-10-10 12:56       ` Michel Quercia
1999-10-10 21:16         ` Pierre Weis
1999-10-11  5:37           ` Lyn A Headley
1999-10-12 18:52             ` skaller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).